I am creating a RESTful API.
My table of e.g. users has a primary key 1,2,3, ...
Now to name my resources in the API I want some more complex name. A hash of something which also will be a unique identifier but a little more difficult to guess.
Should I save this hash in an extra column in my user table or kick the 1,2,3, ... out of the primary key and use the unique hash as global id (database & API)
Why the complexity?
REST API URLs are meant to be discoverable. Obfuscating the resource identifier is anything but discoverable. If you want to keep people from accessing certain data, then secure that data through authentication and authorization. If you're really creating a RESTful API, part of that is discoverability.
So, from the API perspective, the only sane reason I can imagine for doing something like that is avoiding a strong coupling between the URIs and the PK. Like, for instance, you expect to change storage in the future and you don't want to be stuck with a sequential PK forever. If that's the case, I'd say to use a random UUID Version 4, store as a binary value in the database, and use the hex representation to construct the URI. That's what I did in this situation and it works fine.
Now, from the database perspective, I would recommend checking how your database deals with random values as primary key before adopting that. For instance, MySQL insert performance degrades terribly with random values in the clustered index, and it's better to have an unique index for the hash/uuid column, and an auto-increment column as PK.
Other than that, if all you want is to obfuscate the URI, I wouldn't change the database, and simply apply some reversible encoding to the integer value, to use it as part of the URI.
Related
In my tables I use an auto-increment PK on tables where I store for example posts and comments.
I don't want to expose the PK to the HTTP client, however, I still use it internally in my API implementation to perform quick lookups.
When a user wants to retrieve a post by id, I want to have an alternate unique key on the table.
I wonder what is the best (most common) way to use as type for this field.
The most obvious to me would be to use a UUID or GUID.
I wonder if there is a straightforward way to generate a random numeric key for this instead for performance.
What is your take on the best approach for this situation?
MySQL has a function that generates a 128-bit UUID, version 1 as described in RFC 4122, and returns it as a hex string with dashes, by the custom of UUID formatting.
https://dev.mysql.com/doc/refman/5.7/en/miscellaneous-functions.html#function_uuid
A true UUID is meant to be globally unique in space and time. Usually it's overkill unless you need a distributed set of independent servers to generate unique values without some central uniqueness validation, which could create a bottleneck.
MySQL also has a function UUID_SHORT() which generates a 64-bit numeric value. This does not conform with the RFC, but it might be useful for your case.
https://dev.mysql.com/doc/refman/5.7/en/miscellaneous-functions.html#function_uuid-short
Read the description of the UUID_SHORT() implementation. Since the upper bits are seldom changing, and the lower bits are simply monotonically incrementing, it avoids the performance and fragmentation issues caused by inserting random UUID values into an index.
The UUID_SHORT value also fits in a MySQL BIGINT UNSIGNED without having to use UNHEX().
I can either have an auto increment id field as my primary key or a sha1 hash.
Which one should I choose?
Which would be better in terms of performance?
There are a few application-driven cases where you'd want to use a globally unique ID (UUID/GUID):
You expect to (or are) using a sharding strategy to scale writes. You don't want the shard nodes to duplicate keys.
You want to be able to safely port data from one node to another preserving keys. This is critical if you want to keep foreign-key relationships in-tact.
Your application is also used offline (in-home sales, in-home repairs, etc.) where the offline application periodically syncs with the "source of truth". You'd want those offline keys to be unique without having to make a remote call. Otherwise, it is up to you to come up with a strategy to reorganize keys and relationships on the way in. With an auto-increment strategy and depending on the RDBMS you are using, this is likely a non-trivial task.
If you don't have a use-case from above or something similar, you may use an auto-increment id if that makes you comfortable; however, you may still want to consider UUID/GUID
The Trade Off:
There are a lot of opinions held about the speed / size of UUID/GUID keys. At the end of the day, it is a trade-off and there are lots of ways to gain or lose speed with a database. Ideally, you want your indexes to be stored in RAM in order to be as fast as possible; however, that is a trade-off you have to weigh against other considerations.
Other Considerations regarding UUID/GUID:
Many RDBMS can produce a UUID.
You can also produce a UUID via your application (you aren't tied to the RDBMS to generate).
Developers / Testers can easily port data from environment to environment and have the application work as expected. This is an often overlooked use-case; however, it is one of the stronger cases for using a UUID/GUID strategy.
There are databases that are optimized for use offline (CouchDB) where UUID is what you get.
Almost definitely an auto incrementing integer. It will be faster to create, faster to search, and smaller. Consider for example if you had another table that referenced it. Would you want it to reference it via an integral primary key or via a sha1 hash? An integer would be more meaningful (in a way), and it would be much (much!) more efficient.
Use an auto increment id.
An ID does not have to be generted only incremented.
Hashes fit better for storing passwords.
You could get duplicate keys using SHA hashes. The chance is small but real.
An ID is way more readable
An ID is kind of an inserttion history. You know which record was inserted last (highest ID)
anywhere I read that values in select boxes (or anything else in the html code) should not be the primary key of the database table. For example:
<select>
<option value="1">Value 1</option>
<option value="2">Value 2</option>
</select>
In the database there are lookup tables with these values as primary key (1, 2, 3,....). So the data from the select box I store in a table which references this lookup table is a number like 1, 2, 3.... (as the value of the options fields).
I read to better not use the same values in html and as key due to security reasons, but what's the matter with that? I don't understand why this should be a security reason?
Sounds like security-through-obscurity, aka no security at all to me.
A good primary key in a database is purely for uniqueness in the system and shouldn't be related to the meaning of the data. If the primary key was related to the data (say people's social security numbers, stuff like that) then you've got a security issue in exposing the keys, as they are exposing information that could be used maliciously. In that case, whilst you could argue that the best approach from a technical point of view might be to change the application to stop it using those meaningful keys, it may be a more palatable approach to map the keys to some other meaningless key to overcome the issue.
Another scenario that springs to mind where exposing the keys might be interpreted as a security issue is where inadequate authentication and authorisation is in place for writable data in your application/data layer, allowing someone with knowledge of those keys to interfere with the data in the application. Again, securing the system is the better approach.
Aside from security, I can't think of a specific issue if the keys really do identify the data being interacted with and your application is looking up the keys when it generates the page.
I would be concerned about how the information is processed from the URL. What happens if I posted content using value="does_this_break_the_code" or value="can_I_read_secret_info"
It would be wise to exercise caution in using surrogate keys in URLs or in HTML or application code. I wouldn't say the same thing about keys in general.
A surrogate key is not supposed to have business meaning or to have dependencies in application code or external processes. That's often an important consideration for example if key values need to change as a result of the database design evolving or data sets being merged. By using surrogate keys as "magic numbers" in code or in URLs you could compromise the very thing that makes surrogate keys useful. Also surrogate keys are much less convenient to users (and possibly developers) because the values are meaningless to them and therefore less readable than using a natural key.
I suggest you use natural keys in your URLs and persistent code. Keep surrogate keys internal to the database, which is where they are supposed to be.
Primary keys should be used as a unique identifier for each item in the DB, chances are it isn't a part number or anything that relates to the actual item. Generally speaking the PK doesn't MEAN anything, and in the world of semantics, everything should mean something. If there is a better unique identifier, by all means use it, because your PK isn't helpful to anything but your database.
Say you have a database of cars, all cars have a unique identifier called a VIN (Vehicle Identification Number) in the VIN is encoded a bunch of info about each specific car down to the plant that made it. The VIN only identifies that one specific car. the PK on the item could be anything, the car gets dropped from the DB, now the PK doesn't exist, but that VIN is still out there somewhere. It's a much better unique ID than the PK, so that's what should probably be displayed to the users.
which is the best primary key to store website address and page URLs?
To avoid the use of autoincremental id (which is not really tied to the data), I designed the schema with the use of a SHA1 signature of the URL as primary key.
This approach is useful in many ways: for example I don't need to read the last_id from the database so I can prepare all table updates calculating the key and do the real update in a single transaction. No constraint violation.
Anyway I read two books which tell me I am wrong. In "High performance MySQL" it is said that the random key is not good for the DB optimizer. Moreover, in each Joe Celko's books he says the primary key should be some part of the data.
The question is: the natural keys for URLs are... URLs themselves. The fact is that if for a site it is short (www.something.com), there's not an imposed limit for am URL (see http://www.boutell.com/newfaq/misc/urllength.html).
Consider I have to store (and work with) some millions of them.
Which is the best key, then? Autoincremental ids, URLs, hashes of URLs?
You'll want an autoincrement numeric primary key. For the times when you need to pass ids around or join against other tables (for example, optional attributes for a URL), you'll want something small and numeric.
As for what other columns and indexes you want, it depends, as always, on how you're going to use them.
A column storing a hash of each URL is an excellent idea for almost any application that uses a significant number of URLs. It makes SELECTing a URL by its full text about as fast as it's going to get. A second advantage is that if you make that column UNIQUE, you don't need to worry about making the column storing the actual URL unique, and you can use REPLACE INTO and INSERT IGNORE as simple, fast atomic write operations.
I would add that using MySQL's built-in MD5() function is just fine for this purpose. Its only disadvantage is that a dedicated attacker can force collisions, which I'm quite sure you don't care about. Using the built-in function makes, for example, some types of joins much easier. It can be a tiny bit slower to pass a full URL across the wire ("SELECT url FROM urls WHERE hash=MD5('verylongurl')" instead of "WHERE hash='32charhexstring'"), but you'll have the option to do that if you want. Unless you can come up with a concrete scenario where MD5() will let you down, feel free to use it.
The hard question is whether and how you're going to need to look up URLs in ways other than their full text: for example, will you want to find all URLs starting with "/foo" on any "bar.com" host? While "LIKE '%bar.com%/foo%'" will work in testing, it will fail miserably at scale. If your needs include things like that, you can come up with creative ways to generate non-UNIQUE indexes targeted at the type of data you need... maybe a domain_name column, for starters. You'll have to populate those columns from your application, almost certainly (triggers and stored procedures are a lot more trouble than they're worth here, especially if you're concerned about performance -- don't bother).
The good news is that relational databases are very flexible for that sort of thing. You can always add new columns and populate them later. I would suggest for starters: int unsigned auto_increment primary key, unique hash char(32), and (assuming 64K chars suffices) text url.
Presumably you're talking about an entire URL, not just a hostname, including CGI parameters and other stuff.
SHA-1 hashing the URLs makes all the keys long, and makes sorting out trouble fairly obscure. I had to use indexes on hashes once to obscure some confidential data while maintaining the ability to join two tables, and the performance was poor.
There are two possible approaches. One is the naive and obvious one; it will actually work well in mySQL. It has advantages such as simplicity, and the ability to use URL LIKE 'whatever%' to search efficiently.
But if you have lots of URLs concentrated in a few domains ... for example ....
http://stackoverflow.com/questions/3735390/best-primary-key-for-storing-urls
http://stackoverflow.com/questions/3735391/how-to-add-a-c-compiler-flag-to-extconf-rb
etc, you're looking at indexes which vary only in the last characters. In this case you might consider storing and indexing the URLs with their character order reversed. This may lead to a more efficiently accessed index.
(The Oracle table server product happens has a built in way of doing this with a so-called reversed index.)
If I were you I would avoid an autoincrement key unless you have to join more than two tables ON TABLE_A.URL = TABLE_B.URL or some other join condition with that kind of meaing.
Depends on how you use the table. If you mostly select with WHERE url='<url>', then it's fine to have a one-column table. If you can use an autoincrement id to identify an URL in all places in your app, then use the autoincrement
Greetings,
I have some mysql tables that are currently using an md5 hash as a primary key. I normally generate the hash with the value of a column. For instante, let's imagine I have a table called "Artists" with the fields id, name, num_members, year. I tend to make a md5($name) and use it has an ID.
I would like to know what are the downsides of doing this. Is it just better to use integers with AUTO_INCREMENT ? I tend to run away from this because it's just not worth the trouble of finding out what the last id inserted was, and what will be the next etc.
Can you give me some lights on this?
Thank you.
If you need a surrogate primary key, using an AUTO_INCREMENT field is better than an md5 hash, because it is fewer bytes of data, and database backends optimize for integer primary keys.
mysql_insert_id can be used if you need the last inserted id.
If you are generating the primary key as a hash of other columns, why not just use those other columns as a unique key, then join on those?
Another question is, what are the upsides of using an md5 hash? I can't think of any.
The MD5 isn't a true key in this case because it functionally depends on the name. That means that if you have two artists with the same name, you have duplicate "keys" for different records. You could make it a real key by hashing all the attributes together (and hoping that the probability gods don't send you a collision), or you could just save yourself the trouble and use an autoincrementing ID.
It seems like the way you're trying to use the MD5 isn't really buying you any benefit. If "$name" is unique, then why not just use "name" as the primary key? Calculating an MD5 hash and using it as a key for something that's already unique is redundant.
On the other hand, if "name" is not unique, then the MD5 hash won't be unique either and so it's pointless that way too.
Generally you use an MD5 hash when you don't want to store the actual value of the column. For instance, if you're storing passwords, you generally only store the MD5 hash of the password, not the password itself, so that you can't see people's passwords just by looking at the table contents.
If you don't have any unique fields, then you're stuck doing something like an auto-increment because it's at least guaranteed unique. If you use the built-in SQL auto-increment, then you'll just have to fetch the last one way or another. Alternately, if you can get away with keeping a unique counter locally in your application, that avoids having to use auto-increment, but isn't necessarily viable for most applications.
The first approach has one obvious disadvantage: if there are two artists of the same name there will be a primary key collision. Using an INT column with an auto-increment will ensure uniqueness.
Furthermore, though very unlikely, there is a chance that MD5 hashes of different strings could collide (I seem to recall the probability as being 1 in 36 to the power of 32).
The benefits are if you present the IDs to customers (say in a query string for a web form, though that is another no-no)... it prevents users guessing another one.
Personally I use auto-increment without problems (have moved DBs to new servers and everything without problems)