Documents in documentdb are required to have a unique id property.
Each document also has a document link. This link seems to fulfill much of the role a primary key fulfills in a a relational database table, specifically referring to a unique document.
What should I be using the Id property for vs the document link? If I want to store a stable identifier to a document, are there reasons to store the ID rather than the document link?
Both id (which is settable), and _rid (which is generated by DocumentDB) are unique keys for a collection, when combined with the configured partition key value. Most applications just use id. The primary advantage of id is that it is user generated.
There are some advantages to using _rid however:
It is globally unique, yet not as long as a Guid
It is hierarchical, so you don't need to track database, collection, and document IDs - just the _rid
It is a monotonically increasing value, which is a useful property for some use cases (note: increasing within a partition key, no guaranteed ordering across partition keys)
I think you can either choose to use the link field or store the id field and build a link in code. There is only a minuscule cost in latency for the latter, so it's totally up to you which you store as a foreign key. Note, though it's often less round trips to hand build links from ids in non-foreign key circumstances.
A bit of background... The link fields are a bit of legacy, although still useful and probably never going away. When DocumentDB first came out, you couldn't hand build your own links. They fixed this some time in late 2015 or early 2016, if memory serves. I suspect, if they had to do it all over again, they would have done everything with hand-built links using the id.
Related
I've looked for a satisfying answer a tad more specific to my particular problem for a while now, but to avail. Whether I'm just not looking at the right places or not, I don't know, but here goes:
I'm pulling data from an application that afterwards is manipulated and sent to my own server. Amongst the data pulled is an, originally in the application's database, auto-incremented identifier. An example of this identifier I just now retrieved is 955534861. Isn't it better and more effective design to not auto-increment my primary key and just use the value I know is and will always stay unique, or should I look into concepts such as surrogate keys?
Thanks in advance.
The situation you describe resembles my primary job which is maintaining a data warehouse. We get data from other systems and store it.
Something that happens to us is that these "other systems" change. That leads to possibilities that the new version of the "other system" will duplicate the unique identifier from the previous system. We deal with this by adding something to that record in our data warehouse to guarantee it's uniqueness. It might be a field to identify the source system or it might be a date. It is never an autogenerated number.
If there is any chance of this this happening to you, you might want to expand your options.
If there is a natural key in your model, you cannot replace it by creating a surrogate key.
You can only add a surrogate key and keep the existing natural key, which has its pros and cons, as described here.
This'll get a little nerdy, but bear with me:
As long as a key value is unique, it'll serve its function. But for performance, you ideally want that key value to be as short as possible.
GUIDs are commonly used, because they are statistically highly unlikely to ever be repeated. But that comes at the expense of size: they are 128 bits long, which makes them longer than a machine word. To compare two GUIDs (as must be repeatedly done when sorting, or migrating down a b-tree for indexes) will take multiple processor intructions to load and compare the values. And they will consume more memory when cached into memory.
The advantage of auto-incrementing key values is that
They are guaranteed to be unique. Proxy index values are only predicted to be unique.
Because they will have full value coverage over the range of their underlying datatype, the most compact possible type may be used. This makes for smaller indexes and more efficient compare operations
Because the smallest possible type can be used, more index values can be stored on a single database page, which means you're more likely to get a cache hit when searching or joining on that value. That means that peformance will be--all other things being equal--somewhat better.
On most databases, auto-incrementing keys are worked into the database engine, so there is very small overhead in generating them.
If you employ a clustered index on your key value, new record inserts are less likely to require a random disk seek, and more likely to be read during read-ahead, so if you do any kind of sequential processing or lookup based on that key, it'll probably run faster.
The primary key, typically an auto-incrementing ID, is what MySQL uses as a row identifier as well, so it should be left alone. If you need a secondary key that's generated by your application for some other purpose, you may want to add that as another column with a UNIQUE index on it.
In other databases where there's a proper row identifier mechanism, this is less of an issue.
I have a table titled videos. In it there are three columns: media_id, project_id, and video_url. My questions is, is it necessary for me to have media_id? I'm not using it in any other tables. I would expect there to be multiple project_ids with the same number but different video_urls.
Having or not having surrogate ID's for something has nothing to do with normalization.
(copyright catcall)
Having or not having surrogate ID's for something depends on whether or not you have a useful use for it. You already gave the answer to that yourself. And it depends on whether or not there is a significant likelihood that, even if there is no actual use for it right now, such a use might quickly emerge in a nearby future.
You could use project_id and video_url as a function dependency key in your model but at a physical level I would not like to use a URL as part of a key.
By this I mean I prefer an ID or number to avoid typing in long string each time the key is referenced in different tables.
I would consider it necessary. This is purely based on the fact that the media entry is unique and there could be multiple media entries for any one project. This keeps a unique id for the row, a proper project relationship and the valuable URL data for the media resource.
anywhere I read that values in select boxes (or anything else in the html code) should not be the primary key of the database table. For example:
<select>
<option value="1">Value 1</option>
<option value="2">Value 2</option>
</select>
In the database there are lookup tables with these values as primary key (1, 2, 3,....). So the data from the select box I store in a table which references this lookup table is a number like 1, 2, 3.... (as the value of the options fields).
I read to better not use the same values in html and as key due to security reasons, but what's the matter with that? I don't understand why this should be a security reason?
Sounds like security-through-obscurity, aka no security at all to me.
A good primary key in a database is purely for uniqueness in the system and shouldn't be related to the meaning of the data. If the primary key was related to the data (say people's social security numbers, stuff like that) then you've got a security issue in exposing the keys, as they are exposing information that could be used maliciously. In that case, whilst you could argue that the best approach from a technical point of view might be to change the application to stop it using those meaningful keys, it may be a more palatable approach to map the keys to some other meaningless key to overcome the issue.
Another scenario that springs to mind where exposing the keys might be interpreted as a security issue is where inadequate authentication and authorisation is in place for writable data in your application/data layer, allowing someone with knowledge of those keys to interfere with the data in the application. Again, securing the system is the better approach.
Aside from security, I can't think of a specific issue if the keys really do identify the data being interacted with and your application is looking up the keys when it generates the page.
I would be concerned about how the information is processed from the URL. What happens if I posted content using value="does_this_break_the_code" or value="can_I_read_secret_info"
It would be wise to exercise caution in using surrogate keys in URLs or in HTML or application code. I wouldn't say the same thing about keys in general.
A surrogate key is not supposed to have business meaning or to have dependencies in application code or external processes. That's often an important consideration for example if key values need to change as a result of the database design evolving or data sets being merged. By using surrogate keys as "magic numbers" in code or in URLs you could compromise the very thing that makes surrogate keys useful. Also surrogate keys are much less convenient to users (and possibly developers) because the values are meaningless to them and therefore less readable than using a natural key.
I suggest you use natural keys in your URLs and persistent code. Keep surrogate keys internal to the database, which is where they are supposed to be.
Primary keys should be used as a unique identifier for each item in the DB, chances are it isn't a part number or anything that relates to the actual item. Generally speaking the PK doesn't MEAN anything, and in the world of semantics, everything should mean something. If there is a better unique identifier, by all means use it, because your PK isn't helpful to anything but your database.
Say you have a database of cars, all cars have a unique identifier called a VIN (Vehicle Identification Number) in the VIN is encoded a bunch of info about each specific car down to the plant that made it. The VIN only identifies that one specific car. the PK on the item could be anything, the car gets dropped from the DB, now the PK doesn't exist, but that VIN is still out there somewhere. It's a much better unique ID than the PK, so that's what should probably be displayed to the users.
which is the best primary key to store website address and page URLs?
To avoid the use of autoincremental id (which is not really tied to the data), I designed the schema with the use of a SHA1 signature of the URL as primary key.
This approach is useful in many ways: for example I don't need to read the last_id from the database so I can prepare all table updates calculating the key and do the real update in a single transaction. No constraint violation.
Anyway I read two books which tell me I am wrong. In "High performance MySQL" it is said that the random key is not good for the DB optimizer. Moreover, in each Joe Celko's books he says the primary key should be some part of the data.
The question is: the natural keys for URLs are... URLs themselves. The fact is that if for a site it is short (www.something.com), there's not an imposed limit for am URL (see http://www.boutell.com/newfaq/misc/urllength.html).
Consider I have to store (and work with) some millions of them.
Which is the best key, then? Autoincremental ids, URLs, hashes of URLs?
You'll want an autoincrement numeric primary key. For the times when you need to pass ids around or join against other tables (for example, optional attributes for a URL), you'll want something small and numeric.
As for what other columns and indexes you want, it depends, as always, on how you're going to use them.
A column storing a hash of each URL is an excellent idea for almost any application that uses a significant number of URLs. It makes SELECTing a URL by its full text about as fast as it's going to get. A second advantage is that if you make that column UNIQUE, you don't need to worry about making the column storing the actual URL unique, and you can use REPLACE INTO and INSERT IGNORE as simple, fast atomic write operations.
I would add that using MySQL's built-in MD5() function is just fine for this purpose. Its only disadvantage is that a dedicated attacker can force collisions, which I'm quite sure you don't care about. Using the built-in function makes, for example, some types of joins much easier. It can be a tiny bit slower to pass a full URL across the wire ("SELECT url FROM urls WHERE hash=MD5('verylongurl')" instead of "WHERE hash='32charhexstring'"), but you'll have the option to do that if you want. Unless you can come up with a concrete scenario where MD5() will let you down, feel free to use it.
The hard question is whether and how you're going to need to look up URLs in ways other than their full text: for example, will you want to find all URLs starting with "/foo" on any "bar.com" host? While "LIKE '%bar.com%/foo%'" will work in testing, it will fail miserably at scale. If your needs include things like that, you can come up with creative ways to generate non-UNIQUE indexes targeted at the type of data you need... maybe a domain_name column, for starters. You'll have to populate those columns from your application, almost certainly (triggers and stored procedures are a lot more trouble than they're worth here, especially if you're concerned about performance -- don't bother).
The good news is that relational databases are very flexible for that sort of thing. You can always add new columns and populate them later. I would suggest for starters: int unsigned auto_increment primary key, unique hash char(32), and (assuming 64K chars suffices) text url.
Presumably you're talking about an entire URL, not just a hostname, including CGI parameters and other stuff.
SHA-1 hashing the URLs makes all the keys long, and makes sorting out trouble fairly obscure. I had to use indexes on hashes once to obscure some confidential data while maintaining the ability to join two tables, and the performance was poor.
There are two possible approaches. One is the naive and obvious one; it will actually work well in mySQL. It has advantages such as simplicity, and the ability to use URL LIKE 'whatever%' to search efficiently.
But if you have lots of URLs concentrated in a few domains ... for example ....
http://stackoverflow.com/questions/3735390/best-primary-key-for-storing-urls
http://stackoverflow.com/questions/3735391/how-to-add-a-c-compiler-flag-to-extconf-rb
etc, you're looking at indexes which vary only in the last characters. In this case you might consider storing and indexing the URLs with their character order reversed. This may lead to a more efficiently accessed index.
(The Oracle table server product happens has a built in way of doing this with a so-called reversed index.)
If I were you I would avoid an autoincrement key unless you have to join more than two tables ON TABLE_A.URL = TABLE_B.URL or some other join condition with that kind of meaing.
Depends on how you use the table. If you mostly select with WHERE url='<url>', then it's fine to have a one-column table. If you can use an autoincrement id to identify an URL in all places in your app, then use the autoincrement
I'm a beginning programmer, building a non-commercial web-site.
I need a user ID, and I thought it would be logical to use for that a simple INTEGER field with an auto-increment. Does that make sense? The user ID will be not be directly used by users (they'll have to select a user-name); should I care about where they start at (presumably 1)?
Any other best practices I should incorporate in building my 'Users' table?
Thanks!
JDelage
Your design is correct. Your internal PK should be a meaningless number, not seen by users of the system and maintained automatically. It doesn't matter if it starts at 1 and it doesn't matter if it's sequential or not, or if you have "holes" in the sequence. (For cases in which you do expose the number to end users, it is sometimes important that the numbers be neither sequential nor fully-populated so that they are not guessable).
Users should identify themselves to the system with another, meaningful piece of the information (such as an email address). That piece of information should either be guaranteed unique (using a UNIQUE index) or else your front end must provide an interface for disambiguation.
Among the benefits of this design are:
The meaningful identifier for the account can be changed by updating one value in one record of one table, rather than requiring updates all around the database.
Your PK value, which will appear many, many times in the database, is a small and efficiently indexed integer while your user-facing identifier can be of any type you want including a longish text string.
You can use a non-unique identifier with disambiguation if the application calls for it.
auto_increment is okay.
But, you shouldn't care of it's particular number.
Extremely contrary, you should never be concerned of the identifier's particular value. Take is as an abstract identifier only.
Though I doubt it can be invisible to users. Do you have another identifier to use? Auto_inqrement identifiers are usually visible to users as well. For example your ID here is 98361, nobody is hiding it. It is very handy to use such numbers, being unique and unchanged, forever bound to particular table row, it will always identify the same matter (a user, or an article, etc).
An auto incrementing field is fine unless you need to do things like share this ID across multiple databases then you will probably need to create the id value yourself. Also beware of exporting and importing data. If you are not careful all the id values will get reassigned.
In general I avoid auto incrementing fields so I have more control over how the id values are generated. Which is not to say I care what the values are just that they are unique. These are internal values the end user should never see.
Yes, that is correct. Auto-Increment starts at 1, usually. It's not usually accepted to have 0 as an ID.
If you are storing passwords, do not store them as clear text, use md5 (most popular) or some other hash.
Yes, auto incrementing is fine, Problably you will be saving passwords as well, make sure these have some kind of protection, hashing (md5) or encrypting is fine.
Also make sure you index the columns you will use to perform lookups, such as email etc... to avoid full table scans.