Imagine there's a Relational Database System (let's say MySQL) that is clustered in many servers (maybe 100 servers). In this Database System there is a table called "users", and "users" contains a primary key (UINT for instance).
This user ID must be unique among all the servers. This user ID may be auto incrementing.
So how does a distributed database system handle these types of problems ? How does a RDBMS generates a unique index that is unique among all the servers ?
I don't want any SQL code of how to do so in MySQL, I just need to know how it is done in such a case.
[Edit]
Both answers sounds OK.
This is another case, let's take StackOverflow for an example. This Question URL is http://stackoverflow.com/questions/18359434. Another URL is http://stackoverflow.com/questions/18359435, which points to the question that was asked after this question. Obviously stackoverflow has multiple database servers. But the ID for questions are auto-incrementing.
So what's the approach that StackOverflow is using ?
StackOverflow is getting a huge amount of traffic, about 100 both alexa and Quantacast ranks.
The canonical solution is to use uuid() (see here) rather than an integer for such a unique identifier. This is guaranteed to be unique in space as well as time.
A more "hacked" solution is to use two-part primary keys. Have the first be an identifier of "what system am I on" and the second be an auto-incremented number, unique to that system.
Another "hacked" solution is to give each system ranges. Say you are using big integers, then 1,000,000,000 might start the value on one system, 2,000,000,000 on another, and so on.
I would not recommend that you actually try to implement an auto-incremented number across a distributed system. This would basically entail having a single system that maintained the most recent number, and having the other systems ask it for the next number. However you implement this, you will introduce a bottleneck into the system.
In this case I'd use a GUID primary key and I wouldn't have this issue (not sure MySQL knows this though).
The alternative old-fashioned way is to use primary key ranges - that is have one instance use keys from 1.000.000 to 1.999.999, the next use range 2.000.000 to 2.999.999, etc, thus ensuring each instance cannot use the keys of another.
Related
Could you give me any advice, if will be good to mix UUID as primary key and auto increments integer value for different tables in that same database? We want to rebuild database which will be bigger in time and will works in distributed environment. There will be one main database and many more smaller databases on other machines (subsets of main database). The smallest databases will be in sync with main database.
I know that in such distributed systems UUID will be the best choice for primary key. But for example in database there will be tables like page_status or page_type which will not change so often and will not have to many rows. So for performance and readability will be simpler to have only integer value as primary key in such tables. Please let me know what you think and how your experience in this topic looks like. Thanks in advance!
A UUID is the 'right' way to create a unique id when you require these:
The id needs to be constructed independently by different clients.
UUIDs have these problems:
Bulky: 16 bytes per use. Note that "use" includes all secondary keys, and joining tables. It adds up.
Randomness: When a table is bigger than RAM, references are slowed to disk speed.
The alternatives are
Have a single source (eg, a database) that delivers the 'next' id when asked. This is limited in how fast the ids can be generated.
Devise a mechanism for having clients independently generate unique ids, but not based on UUID -- see the problem above. Example: A 64-bit integer with time in top, then uniqueness number (within the client), then client number.
You could map UUIDs to smaller AIs, which are then used various places. But this adds complexity.
Juggling the bits of a Type-1 UUID makes the roughly chronological; this avoids the randomness. Discussed in http://mysql.rjweb.org/doc.php/uuid . The functions for that are built into MySQL 8.0.
I know very little about MySQL (or web development in general). I'm a Unity game dev and I've got a situation where users (of a region the size of which I haven't decided yet, possibly globally) can submit entries to an online database. The users must be able to then locate their entry at any time.
For this reason, I've generated a guid from .Net (System.Guid.NewGuid()) and am storing that in the database entry. This works for me! However... I'm no expert, but my gut tells me that looking up a complex string in what could be a gargantuan table might have terrible performance.
That said, it doesn't seem like anything other than a globally unique identifier will solve my problem. Is there a more elegant solution that I'm not seeing, or a way to mitigate against any issues this design pattern might create?
Thanks!
Make sure you define the GUID column as the primary key in the MySQL table. That will cause MySQL to create an index on it, which will enable MySQL to quickly find a row given the GUID. The table might be gargantuan but (assuming a regular B-tree index) the time required for a lookup will increase logarithmically relative to the size of the table. In other words, if it requires 2 reads to find a row in a 1,000-row table, finding a row in a 1,000,000-row table will only require 2 more reads, not 1,000 times as many.
As long as you have defined the primary key, the performance should be good. This is what the database is designed to do.
Obviously there are limits to everything. If you have a billion users and they're submitting thousands of these entries every second, then maybe a regular indexed MySQL table won't be sufficient. But I wouldn't go looking for some exotic solution before you even have a problem.
If you have a key of the row you want, and you have an index on that key, then this query will take less than a second, even if the table has a billion rows:
SELECT ... FROM t WHERE id = 1234.
The index in question might be the PRIMARY KEY, or it could be a secondary key.
GUIDs/UUIDs should be used only if you need to manufacture unique ids in multiple clients without asking the database for an id. If you do use such, be aware that GUIDs perform poorly if the table is bigger than RAM.
I have a mysql database with 220 tables. The database is will structured but without any clear relations. I want to find a way to connect the primary key of each table to its correspondent foreign key.
I was thinking to write a script to discover the possible relation between two columns:
The content range should be similar in both of them
The foreign key name could be similar to the primary key table name
Those features are not sufficient to solve the problem. Do you have any idea how I could be more accurate and close to the solution. Also, If there's any available tool which do that.
Please Advice!
Sounds like you have a licensed app+RFS, and you want to save the data (which is an asset that belongs to the organisation), and ditch the app (due to the problems having exceeded the threshold of acceptability).
Happens all the time. Until something like this happens, people do not appreciate that their data is precious, that it out-lives any app, good or bad, in-house or third-party.
SQL Platform
If it was an honest SQL platform, it would have the SQL-compliant catalogue, and the catalogue would contain an entry for each reference. The catalogue is an entry-level SQL Compliance requirement. The code required to access the catalogue and extract the FOREIGN KEY declarations is simple, and it is written in SQL.
Unless you are saying "there are no Referential Integrity constraints, it is all controlled from the app layers", which means it is not a database, it is a data storage location, a Record Filing System, a slave of the app.
In that case, your data has no Referential Integrity
Pretend SQL Platform
Evidently non-compliant databases such as MySQL, PostgreSQL and Oracle fraudulently position themselves as "SQL", but they do not have basic SQL functionality, such as a catalogue. I suppose you get what you pay for.
Solution
For (a) such databases, such as your MySQL, and (b) data placed in an honest SQL container that has no FOREIGN KEY declarations, I would use one of two methods.
Solution 1
First preference.
use awk
load each table into an array
write scripts to:
determine the Keys (if your "keys" are ID fields, you are stuffed, details below)
determine any references between the Keys of the arrays
Solution 2
Now you could do all that in SQL, but then, the code would be horrendous, and SQL is not designed for that (table comparisons). Which is why I would use awk, in which case the code (for an experienced coder) is complex (given 220 files) but straight-forward. That is squarely within awks design and purpose. It would take far less development time.
I wouldn't attempt to provide code here, there are too many dependencies to identify, it would be premature and primitive.
Relational Keys
Relational Keys, as required by Codd's Relational Model, relate ("link", "map", "connect") each row in each table to the rows in any other table that it is related to, by Key. These Keys are natural Keys, and usually compound Keys. Keys are logical identifiers of the data. Thus, writing either awk programs or SQL code to determine:
the Keys
the occurrences of the Keys elsewhere
and thus the dependencies
is a pretty straight-forward matter, because the Keys are visible, recognisable as such.
This is also very important for data that is exported from the database to some other system (which is precisely what we are trying to do here). The Keys have meaning, to the organisation, and that meaning is beyond the database. Thus importation is easy. Codd wrote about this value specifically in the RM.
This is just one of the many scenarios where the value of Relational Keys, the absolute need for them, is appreciated.
Non-keys
Conversely, if your Record Filing System has no Relational Keys, then you are stuffed, and stuffed big time. The IDs are in fact record numbers in the files. They all have the same range, say 1 to 1 million. It is not reasonably possible to relate any given record number in one file to its occurrences in any other file, because record numbers have no meaning.
Record numbers are physical, they do not identify the data.
I see a record number 123456 being repeated in the Invoice file, now what other file does this relate to ? Every other possible file, Supplier, Customer, Part, Address, CreditCard, where it occurs once only, has a record number 123456 !
Whereas with Relational Keys:
I see IBM plus a sequence 1, 2, 3, ... in the Invoice table, now what other table does this relate to ? The only table that has IBM occurring once is the Customer table.
The moral of the story, to etch into one's mind, is this. Actually there are a few, even when limiting them to the context of this Question:
If you want a Relational Database, use Relational Keys, do not use Record IDs
If you want Referential Integrity, use Relational Keys, do not use Record IDs
If your data is precious, use Relational Keys, do not use Record IDs
If you want to export/import your data, use Relational Keys, do not use Record IDs
I've looked for a satisfying answer a tad more specific to my particular problem for a while now, but to avail. Whether I'm just not looking at the right places or not, I don't know, but here goes:
I'm pulling data from an application that afterwards is manipulated and sent to my own server. Amongst the data pulled is an, originally in the application's database, auto-incremented identifier. An example of this identifier I just now retrieved is 955534861. Isn't it better and more effective design to not auto-increment my primary key and just use the value I know is and will always stay unique, or should I look into concepts such as surrogate keys?
Thanks in advance.
The situation you describe resembles my primary job which is maintaining a data warehouse. We get data from other systems and store it.
Something that happens to us is that these "other systems" change. That leads to possibilities that the new version of the "other system" will duplicate the unique identifier from the previous system. We deal with this by adding something to that record in our data warehouse to guarantee it's uniqueness. It might be a field to identify the source system or it might be a date. It is never an autogenerated number.
If there is any chance of this this happening to you, you might want to expand your options.
If there is a natural key in your model, you cannot replace it by creating a surrogate key.
You can only add a surrogate key and keep the existing natural key, which has its pros and cons, as described here.
This'll get a little nerdy, but bear with me:
As long as a key value is unique, it'll serve its function. But for performance, you ideally want that key value to be as short as possible.
GUIDs are commonly used, because they are statistically highly unlikely to ever be repeated. But that comes at the expense of size: they are 128 bits long, which makes them longer than a machine word. To compare two GUIDs (as must be repeatedly done when sorting, or migrating down a b-tree for indexes) will take multiple processor intructions to load and compare the values. And they will consume more memory when cached into memory.
The advantage of auto-incrementing key values is that
They are guaranteed to be unique. Proxy index values are only predicted to be unique.
Because they will have full value coverage over the range of their underlying datatype, the most compact possible type may be used. This makes for smaller indexes and more efficient compare operations
Because the smallest possible type can be used, more index values can be stored on a single database page, which means you're more likely to get a cache hit when searching or joining on that value. That means that peformance will be--all other things being equal--somewhat better.
On most databases, auto-incrementing keys are worked into the database engine, so there is very small overhead in generating them.
If you employ a clustered index on your key value, new record inserts are less likely to require a random disk seek, and more likely to be read during read-ahead, so if you do any kind of sequential processing or lookup based on that key, it'll probably run faster.
The primary key, typically an auto-incrementing ID, is what MySQL uses as a row identifier as well, so it should be left alone. If you need a secondary key that's generated by your application for some other purpose, you may want to add that as another column with a UNIQUE index on it.
In other databases where there's a proper row identifier mechanism, this is less of an issue.
I can either have an auto increment id field as my primary key or a sha1 hash.
Which one should I choose?
Which would be better in terms of performance?
There are a few application-driven cases where you'd want to use a globally unique ID (UUID/GUID):
You expect to (or are) using a sharding strategy to scale writes. You don't want the shard nodes to duplicate keys.
You want to be able to safely port data from one node to another preserving keys. This is critical if you want to keep foreign-key relationships in-tact.
Your application is also used offline (in-home sales, in-home repairs, etc.) where the offline application periodically syncs with the "source of truth". You'd want those offline keys to be unique without having to make a remote call. Otherwise, it is up to you to come up with a strategy to reorganize keys and relationships on the way in. With an auto-increment strategy and depending on the RDBMS you are using, this is likely a non-trivial task.
If you don't have a use-case from above or something similar, you may use an auto-increment id if that makes you comfortable; however, you may still want to consider UUID/GUID
The Trade Off:
There are a lot of opinions held about the speed / size of UUID/GUID keys. At the end of the day, it is a trade-off and there are lots of ways to gain or lose speed with a database. Ideally, you want your indexes to be stored in RAM in order to be as fast as possible; however, that is a trade-off you have to weigh against other considerations.
Other Considerations regarding UUID/GUID:
Many RDBMS can produce a UUID.
You can also produce a UUID via your application (you aren't tied to the RDBMS to generate).
Developers / Testers can easily port data from environment to environment and have the application work as expected. This is an often overlooked use-case; however, it is one of the stronger cases for using a UUID/GUID strategy.
There are databases that are optimized for use offline (CouchDB) where UUID is what you get.
Almost definitely an auto incrementing integer. It will be faster to create, faster to search, and smaller. Consider for example if you had another table that referenced it. Would you want it to reference it via an integral primary key or via a sha1 hash? An integer would be more meaningful (in a way), and it would be much (much!) more efficient.
Use an auto increment id.
An ID does not have to be generted only incremented.
Hashes fit better for storing passwords.
You could get duplicate keys using SHA hashes. The chance is small but real.
An ID is way more readable
An ID is kind of an inserttion history. You know which record was inserted last (highest ID)