Best approach for having unique row IDs in the whole database rather than just in one table? - mysql

I'm designing a database for a project of mine, and in the project I have many different kinds of objects.
Every object might have comments on it - which it pulls from the same comments table.
I noticed I might run into problems when two different kind of objects have the same id, and when pulling from the comments table they will pull each other comments.
I could just solve it by adding an object_type column, but it will be harder to maintain when querying, etc.
What is the best approach to have unique row IDs across my whole database?
I noticed Facebook number their objects with a really, really large number IDs, and probably determine the type of it by id mod trillion or some other really big number.
Though that might work, are there any more options to achieve the same thing, or relying on big enough number ranges should be fine?
Thanks!

You could use something like what Twitter uses for their unique IDs.
http://engineering.twitter.com/2010/06/announcing-snowflake.html
For every object you create, you will have to make some sort of API call to this service, though.

Why not tweaking your concept of object_type by integrating it in the id column? For example, an ID would be a concatenation of the object type, a separator and a unique ID within the column.
This approach might scale better, as a unique ID generator for the whole database might lead to a performance bottleneck.

If you only have one database instance, you can create a new table to allocate IDs:
CREATE TABLE id_gen (
id BIGINT PRIMARY KEY AUTO_INCREMENT NOT NULL
);
Now you can easily generate new unique IDs and use them to store your rows:
INSERT INTO id_gen () VALUES ();
INSERT INTO foo (id, x) VALUES (LAST_INSERT_ID(), 42);
Of course, the moment you have to shard this, you're in a bit of trouble. You could set aside a single database instance that manages this table, but then you have a single point of failure for all writes and a significant I/O bottleneck (that only grows worse if you ever have to deal with geographically disparate datacenters).
Instagram has a wonderful blog post on their ID generation scheme, which leverages PostgreSQL's awesomeness and some knowledge about their particular application to generate unique IDs across shards.
Another approach is to use UUIDs, which are extremely unlikely to exhibit collisions. You get global uniqueness for "free", with some tradeoffs:
slightly larger size: a BIGINT is 8 bytes, while a UUID is 16 bytes;
indexing pains: INSERT is slower for unsorted keys. (UUIDs are actually preferable to hashes, as they contain a timestamp-ordered segment.)
Yet another approach (which was mentioned previously) is to use a scalable ID generation service such as Snowflake. (Of course, this involves installing, integrating, and maintaining said service; the feasibility of doing that is highly project-specific.)

I am using tables as object classes, rows as objects and columns as object parameters. Everything starts with the class techname, in which every object has its unique identifier, which is unique in the database. The object classes are registered as objects in the table object classes, and the parameters for each object class are linked to it.

Related

how to manage common information between multiple tables in databases

this is my first question on stack-overflow, i am a full-stack developer i work with the following stack: Java - spring - angular - MySQL. i am working on a side project and i have a database design questions.
i have some information that are common between multiple tables like:
Document information (can be used initially in FOLDER and CONTRACT
tables).
Type information(tables: COURT, FOLDER, OPPONENT, ...).
Status (tables: CONTRACT, FOLDER, ...).
Address (tables: OFFICE, CLIENT, OPPONENT, COURT, ...).
To avoid repetition and coupling the core tables with "Technical" tables (information that can be used in many tables). i am thinking about merging the "Technical" tables into one functional table. for example we can have a generic DOCUMENT table with the following columns:
ID
TITLE
DESCRIPTION
CREATION_DATE
TYPE_DOCUMENT (FOLDER, CONTRACT, ...)
OBJECT_ID (Primary key of the TYPE_DOCUMENT Table)
OFFICE_ID
PATT_DATA
for example we can retrieve the information about a document with the following query:
SELECT * FROM DOCUMENT WHERE OFFICE_ID = "office 1 ID" AND TYPE_DOCUMENT = "CONTRACT" AND OBJECT_ID= "contract ID";
we can also use the following index to optimize the query:
CREATE INDEX idx_document_retrieve ON DOCUMENT (OFFICE_ID, TYPE_DOCUMENT, OBJECT_ID);
My questions are:
is this a good design.
is there a better way of implementing this design.
should i just use normal database design, for example a Folder can
have many documents, so i create a folder_document table with the
folder_id as a foreign key. and do the same for all the tables.
Any suggestions or notes are very welcomed and thank you in advance for the help.
What you're describing sounds like you're trying to decide whether to denormalize and how much to denormalize.
The answer is: it depends on your queries. Denormalization makes it more convenient or more performant to do certain queries against your data, at the expense of making it harder or more inefficient to do other queries. It also makes it hard to keep the redundant data in sync.
So you would like to minimize the denormalization and do it only when it gives you good advantages in queries you need to be optimal.
Normalizing optimizes for data relationships. This makes a database organization that is not optimized for any specific query, but is equally well suited to all your queries, and it also has the advantage of preventing data anomalies.
Denormalization optimizes for specific queries, but at the expense of other queries. It's up to you to know which of your queries you need to prioritize, and which of your queries can suffer.
If you can't decide which of your queries deserves priority, or you can't predict whether you will have other new queries in the future, then you should stick with a normalized design.
There's no way anyone on Stack Overflow can know your queries better than you do.
Case 1: status
"Status" is usually a single value. To make it readable, you might use ENUM. If you need further info about a status, there be a separate table with PRIMARY KEY(status) with other columns about the statuses.
Case 2: address
"Address" is bulky and possibly multiple columns. (However, since the components of an "address" is rarely needed by in WHERE or ORDER BY clauses, there is rarely a good reason to have it in any form other than TEXT and with embedded newlines.
However, "addressis usually implemented as several separate fields. In this case, a separate table is a good idea. It would have a columnid MEDIUMINT UNSIGNED AUTO_INCREMENT PRIMARY KEYand the various columns. Then, the other tables would simply refer to it with anaddress_idcolumn andJOIN` to that table when needed. This is clean and works well even if many tables have addresses.
One Caveat: When you need to change the address of some entity, be careful if you have de-dupped the addresses. It is probably better to always add a new address and waste the space for any no-longer-needed address.
Discussion
Those two cases (status and access) are perhaps the extremes. For each potentially common column, decide which makes more sense. As Bill points, out, you really need to be thinking about the queries in order to get the schema 'right'. You must write the main queries before deciding on indexes other than the PRIMARY KEY. (So, I won't now address your question about an Index.)
Do not use a 4-byte INT for something that is small, mostly immutable, and easier to read:
2-byte country_code (US, UK, JP, ...)
5-byte zip-code CHAR(5) CHARSET ascii; similar for 6-byte postal_code
1-byte `ENUM('maybe', 'no', 'yes')
1-byte `ENUM('not_specified', 'Male', 'Female', 'other'); this might not be good if you try to enumerate all the "others".
1-byte ENUM('folder', ...)
Your "folder" vs "document" is an example of a one-to-many relationship. Yes, it is implemented by having doc_id in the table Folders.
"many-to-many" requires an extra table for connecting the two tables.
ENUM
Some will argue against ever using ENUM. In your situation, there is no way to ensure that each table uses the same definition of, for example, doc_type. It is easy to add a new option on the end of the list, but costly to otherwise rearrange an ENUM.
ID
id (or ID) is almost universally reserved (by convention) to mean the PRIMARY KEY of a table, and it is usually (but not necessarily) AUTO_INCREMENT. Please don't violate this convention. Notice in my example above, id was the PK of the Addresses table, but called address_id in the referring table. You can optionally make a FOREIGN KEY between the two tables.

Which is better, using a central ID store or assigning IDs based on tables

In many ERP Systems (Locally) I have seen that Databases (Generally MYSQL) uses central key store (Resource Identity). Why is that?
i.e. In a database one special table is maintained for generation of IDs which will have one cell (first one) which will have a number (ID) which is assigned to the subsequent tuple (i.e. common ID generation for all the tables in the same database).
Also in this table the entry for last inserted batch details are inserted. i.e. when 5 tuples in table ABC is inserted and, lets say that last ID of item in the batch is X, then an entry in the table (the central key store) is also inserted like ('ABC', X).
Is there any significance of this architecture?
And also where can I find the case study of common large scale custom built ERP system?
If I understand this correctly, you are asking why would someone replace IDs that are unique only for a table
TABLE clients (id_client AUTO_INCREMENT, name, address)
TABLE products (id_product AUTO_INCREMENT, name, price)
TABLE orders (id_order AUTO_INCREMENT, id_client, date)
TABLE order_details (id_order_detail AUTO_INCREMENT, id_order, id_product, amount)
with global IDs that are unique within the whole database
TABLE objects (id AUTO_INCREMENT)
TABLE clients (id_object, name, address)
TABLE products (id_object, name, price)
TABLE orders (id_object, id_object_client, date)
TABLE order_details (id_object, id_object_order, id_object_product, amount)
(Of course you could still call these IDs id_product etc. rather than id_object. I only used the name id_object for clarification.)
The first approach is the common one. When inserting a new row into a table you get the next available ID for the table. If two sessions want to insert at the same time, one must wait briefly.
The second approach hence leads to sessions waiting for their turn everytime they want to insert data, no matter what table, as they all get their IDs from the objects table. The big advantage is that when exporting data, you have global references. Say you export orders and the recipient tells you: "We have problems with your order details 12345. There must be something wrong with your data". Wouldn't it be great, if you could tell them "12345 is not an order detail ID, but a product ID. Do you have problems importing the product or can you tell me an order detail ID this is about?" rather than looking at an order detail record for hours that happens to have the number 12345, while it had nothing to do with the issue, really?
That said, it might be a better choice to use the first approach and add a UUID to all tables you'd use for external communication. No fight for the next ID and still no mistaken IDs in communication :-)
This is the common strategy used in datawarehouse to track the the batch number after successful or failure of dataloading, in case the loading of the data got failed you will say something like 'ABC' , 'Batch_num' and 'Error_Code' in the batch control table, so your further logic of loading can decide on what do with failure and can easily track the loading, in case if you want to recheck we can archive the data. This ID's are usually generated by a sequence in data base, in one word it is mostly used for monitoring purposes.
You can refer this link for more details
There are several more techniques, each with pros and cons. But let me start by pointing out two techniques that hit a brick wall at some point in scaling up. Let's assume you have billions of items, probably scattered across multiple server either by sharding or other techniques.
Brick wall #1: UUIDs are handy because clients can create them without having to ask some central server for values. But UUIDs are very random. This means that, in most situations, each reference incurs a disk hit because the id is unlikely to be cached.
Brick wall #2: Ask a central server, which has an AUTO_INCREMENT under the covers to dole out ids. I watched a social media site that was doing nothing but collecting images for sharing crash because of this. That's in spite of there being a server whose sole purpose is to hand out numbers.
Solution #1:
Here's one (of several) solutions that avoids most problems: Have a central server that hands out 100 ids at a time. After a client uses up the 100 it has been given, it asks for a new batch. If the client crashes, some of the last 100 are "lost". Oh, well; no big deal.
That solution is upwards of 100 times as good as brick wall #2. And the ids are much less random than those for brick wall #1.
Solution #2: Each client can generate its own 64-bit, semi-sequential, ids. The number includes a version number, some of the clock, a dedup-part, and the client-id. So it is roughly chronological (worldwide), and guaranteed to be unique. But still have good locality of reference for items created at about the same time.
Note: My techniques can be adapted for use by individual tables or as an uber-number for all tables. That distinction may have been your 'real' question. (The other Answers address that.)
The downside to such a design is that it puts a tremendous load on the central table when inserting new data. It is a built-in bottleneck.
Some "advantages" are:
Any resource id that is found anywhere in the system can be readily identified, regardless of type.
If there is any text in the table (such as a name or description), then it is all centralized facilitating multi-lingual support.
Foreign key references can work across multiple types.
The third is not really an advantage, because it comes with a downside: the inability to specify a specific type for foreign key references.

DB Design - any way to avoid duplicating columns here?

I've got a database that stores hash values and a few pieces of data about the hash, all in one table. One of the fields is 'job_id', which is the ID for the job that the hash came from.
The problem I'm trying to solve is that with this design, a hash can only belong to one job - in reality a hash can occur in many jobs, and I'd like to know each job in which a hash occurs.
The way I'm thinking of doing this is to create a new table called 'Jobs', with fields 'job_id', 'job_name' and 'hash_value'. When a new batch of data is inserted into the DB, the job ID and name would be created here and each hash would go into here as well as the original hash table, but in the Jobs table it'd also be stored against the job.
I don't like this, because I'd be duplicating the hash column across tables. Is there a better way? I can add to the hash table but can't take away any columns because closed-source software depends on it. The hash value is the primary key. It's MySQL and the database stores many millions of records. Thanks in advance!
Adding the new job table is the way to go. It's the normative practice, for representing a one-to-many relationship.
It's good to avoid unnecessary duplication of values. But in this case, you aren't really "duplicating" the hash_value column; rather, you are really defining a relationship between job and the table that has hash_value as the primary key.
The relationship is implemented by adding a column to the child table; that column holds the primary key value from the parent table. Typically, we add a FOREIGN KEY constraint on the column as well.
The problem I'm trying to solve is that with this design, a hash can
only belong to one job - in reality a hash can occur in many jobs, and
I'd like to know each job in which a hash occurs.
The way I'm thinking of doing this is to create a new table called
'Jobs', with fields 'job_id', 'job_name' and 'hash_value'.
As long as you can also get a) the foreign keys right and b) the cascades right for both "job_id" and "hash_value", that should be fine.
Duplicate data and redundant data are technical terms in relational modeling. Technical term means they have meanings that you're not likely to find in a dictionary. They don't mean "the same values appear in multiple tables." That should be obvious, because if you replace the values with surrogate ID numbers, those ID numbers will then appear in multiple tables.
Those technical terms actually mean "identical values with identical meaning." (Relevant: Hugh Darwen's article for definition and use of predicates.)
There might be good, practical reasons for replacing text with an ID number, but there are no theoretical reasons to do that, and normalization certainly doesn't require it. (There's no "every row has an ID number" normal form.)
If i read your question correctly, your design is fundamentally flawed, because of these two facts:
the hash is the primary key (quoted from your question)
the same hash can be generated from multiple different inputs (fact)
you have millions of hashes (from question)
With the many millions of rows/hashes, eventually you'll get a hash collision.
The only sane approach is to have job_id as the primary key and hash in a column with a non-unique index on it. Finding job(s) given a hash would be straightforward.

To merge the table or not for performance/centralisation

I have been working on my database and the thought occurred to me that maybe it would be better to combine two of my tables to better organise the data and perhaps get performance benefits (or not?).
I have two tables that contain addresses, delivery and the other invoice, their structure is identical. One table contains invoice addresses and the other contains delivery.
What would be the implications of merging these together into one table simply called "addresses", and create a new column called addressTypeId? This new column references a new table that contains address types like delivery, invoice, home etc.
Is having them how they are now, separate, better for performance as requests for the different types of addresses (delivery and invoice) make use of two tables as opposed to one table which might mean delays when requesting address data?
By the way I am using INNODB.
If you are missing the appropriate indexes, then the look up performance will drop by a factor of two (if you are merging two equally sized tables). However, if you are missing indexes, you likely don't care about the performance.
Lookup using a hashed index is constant-time. Lookup using a tree index is logarithmic, so the effect is small. Writes to a tree index are logarithmic as well and writes to a hash map are amortized constant.
don't suffer from premature optimization!!!
A good design is more important than peak performance. Address lookup is likely not your bottleneck. A bad code resulting from a bad database design far outweighs any benefits. If you make two tables, you are going to duplicate code. Code duplication is a maintainance nightmare.
Merge the tables. You will be thankful when you need to extend your application in the near future. You could want to make more address types. You could want to add common functionality to the addresses (formatting). Your customers will not notice the extra milisecond from traversing one more level of a binary tree. They will notice you have a hard time adding an extra feature and they will notice inconsistencies arising from code duplication.
You might even gain performance by merging the tables. While you might need to traverse an extra node in a tree, the tree might be more likely to be cached in memory and not need disk access. Disk access is expensive. You might reduce disk access by merging.
As #BenP.P.Tung already said, you don't need an extra table for an enumeration. Use an enumeration type.
If you just need to distinguish the address difference. I suggest what you need is a ENUM column in this merged table. If it is exist, you can add a new column like following,
alter table add addressTypes ENUM('delivery','invoice','home') DEFAULT NULL;
Or DEFAULT invoice something you think should be default when you can not get the required information.
Don't need to put all enum values at a time. Just what you needed now, and add more value in the future as following.
alter table change addressTypes addressTypes ENUM('delivery','invoice','home','office') DEFAULT NULL;
One table will work fine. If there is a performance concern, then add the address type column to the primary index at the start of the index. This will avoid any performance issues until you have a very large number of addresses.
their structure is identical.
Are their constraints identical as well?1
If yes, merge the addresses.
If no, keep them separate.
Constraints are as much part of the table as are its fields.
Is having them how they are now, separate, better for performance as requests for the different types of addresses (delivery and invoice) make use of two tables as opposed to one table which might mean delays when requesting address data?
Do you query both kinds of addresses in the same way?
If yes, it shouldn't matter either way (assuming you indexed correctly).
If not, then different tables enable you to index or cluster your data differently.
Related posts:
Data modeling for Same tables with same columns
Two tables with same columns or one table with additional column?
1 For example, are both delivery and invoice supposed to be able to reference (through foreign keys) the same address? Are PKs of addresses supposed to be unique for all addresses or just for addresses of particular type? Are there any CHECKs that exist for one address type and not for the other? Etc, etc...

Best primary key for storing URLs

which is the best primary key to store website address and page URLs?
To avoid the use of autoincremental id (which is not really tied to the data), I designed the schema with the use of a SHA1 signature of the URL as primary key.
This approach is useful in many ways: for example I don't need to read the last_id from the database so I can prepare all table updates calculating the key and do the real update in a single transaction. No constraint violation.
Anyway I read two books which tell me I am wrong. In "High performance MySQL" it is said that the random key is not good for the DB optimizer. Moreover, in each Joe Celko's books he says the primary key should be some part of the data.
The question is: the natural keys for URLs are... URLs themselves. The fact is that if for a site it is short (www.something.com), there's not an imposed limit for am URL (see http://www.boutell.com/newfaq/misc/urllength.html).
Consider I have to store (and work with) some millions of them.
Which is the best key, then? Autoincremental ids, URLs, hashes of URLs?
You'll want an autoincrement numeric primary key. For the times when you need to pass ids around or join against other tables (for example, optional attributes for a URL), you'll want something small and numeric.
As for what other columns and indexes you want, it depends, as always, on how you're going to use them.
A column storing a hash of each URL is an excellent idea for almost any application that uses a significant number of URLs. It makes SELECTing a URL by its full text about as fast as it's going to get. A second advantage is that if you make that column UNIQUE, you don't need to worry about making the column storing the actual URL unique, and you can use REPLACE INTO and INSERT IGNORE as simple, fast atomic write operations.
I would add that using MySQL's built-in MD5() function is just fine for this purpose. Its only disadvantage is that a dedicated attacker can force collisions, which I'm quite sure you don't care about. Using the built-in function makes, for example, some types of joins much easier. It can be a tiny bit slower to pass a full URL across the wire ("SELECT url FROM urls WHERE hash=MD5('verylongurl')" instead of "WHERE hash='32charhexstring'"), but you'll have the option to do that if you want. Unless you can come up with a concrete scenario where MD5() will let you down, feel free to use it.
The hard question is whether and how you're going to need to look up URLs in ways other than their full text: for example, will you want to find all URLs starting with "/foo" on any "bar.com" host? While "LIKE '%bar.com%/foo%'" will work in testing, it will fail miserably at scale. If your needs include things like that, you can come up with creative ways to generate non-UNIQUE indexes targeted at the type of data you need... maybe a domain_name column, for starters. You'll have to populate those columns from your application, almost certainly (triggers and stored procedures are a lot more trouble than they're worth here, especially if you're concerned about performance -- don't bother).
The good news is that relational databases are very flexible for that sort of thing. You can always add new columns and populate them later. I would suggest for starters: int unsigned auto_increment primary key, unique hash char(32), and (assuming 64K chars suffices) text url.
Presumably you're talking about an entire URL, not just a hostname, including CGI parameters and other stuff.
SHA-1 hashing the URLs makes all the keys long, and makes sorting out trouble fairly obscure. I had to use indexes on hashes once to obscure some confidential data while maintaining the ability to join two tables, and the performance was poor.
There are two possible approaches. One is the naive and obvious one; it will actually work well in mySQL. It has advantages such as simplicity, and the ability to use URL LIKE 'whatever%' to search efficiently.
But if you have lots of URLs concentrated in a few domains ... for example ....
http://stackoverflow.com/questions/3735390/best-primary-key-for-storing-urls
http://stackoverflow.com/questions/3735391/how-to-add-a-c-compiler-flag-to-extconf-rb
etc, you're looking at indexes which vary only in the last characters. In this case you might consider storing and indexing the URLs with their character order reversed. This may lead to a more efficiently accessed index.
(The Oracle table server product happens has a built in way of doing this with a so-called reversed index.)
If I were you I would avoid an autoincrement key unless you have to join more than two tables ON TABLE_A.URL = TABLE_B.URL or some other join condition with that kind of meaing.
Depends on how you use the table. If you mostly select with WHERE url='<url>', then it's fine to have a one-column table. If you can use an autoincrement id to identify an URL in all places in your app, then use the autoincrement