Best primary key for storing URLs - mysql

which is the best primary key to store website address and page URLs?
To avoid the use of autoincremental id (which is not really tied to the data), I designed the schema with the use of a SHA1 signature of the URL as primary key.
This approach is useful in many ways: for example I don't need to read the last_id from the database so I can prepare all table updates calculating the key and do the real update in a single transaction. No constraint violation.
Anyway I read two books which tell me I am wrong. In "High performance MySQL" it is said that the random key is not good for the DB optimizer. Moreover, in each Joe Celko's books he says the primary key should be some part of the data.
The question is: the natural keys for URLs are... URLs themselves. The fact is that if for a site it is short (www.something.com), there's not an imposed limit for am URL (see http://www.boutell.com/newfaq/misc/urllength.html).
Consider I have to store (and work with) some millions of them.
Which is the best key, then? Autoincremental ids, URLs, hashes of URLs?

You'll want an autoincrement numeric primary key. For the times when you need to pass ids around or join against other tables (for example, optional attributes for a URL), you'll want something small and numeric.
As for what other columns and indexes you want, it depends, as always, on how you're going to use them.
A column storing a hash of each URL is an excellent idea for almost any application that uses a significant number of URLs. It makes SELECTing a URL by its full text about as fast as it's going to get. A second advantage is that if you make that column UNIQUE, you don't need to worry about making the column storing the actual URL unique, and you can use REPLACE INTO and INSERT IGNORE as simple, fast atomic write operations.
I would add that using MySQL's built-in MD5() function is just fine for this purpose. Its only disadvantage is that a dedicated attacker can force collisions, which I'm quite sure you don't care about. Using the built-in function makes, for example, some types of joins much easier. It can be a tiny bit slower to pass a full URL across the wire ("SELECT url FROM urls WHERE hash=MD5('verylongurl')" instead of "WHERE hash='32charhexstring'"), but you'll have the option to do that if you want. Unless you can come up with a concrete scenario where MD5() will let you down, feel free to use it.
The hard question is whether and how you're going to need to look up URLs in ways other than their full text: for example, will you want to find all URLs starting with "/foo" on any "bar.com" host? While "LIKE '%bar.com%/foo%'" will work in testing, it will fail miserably at scale. If your needs include things like that, you can come up with creative ways to generate non-UNIQUE indexes targeted at the type of data you need... maybe a domain_name column, for starters. You'll have to populate those columns from your application, almost certainly (triggers and stored procedures are a lot more trouble than they're worth here, especially if you're concerned about performance -- don't bother).
The good news is that relational databases are very flexible for that sort of thing. You can always add new columns and populate them later. I would suggest for starters: int unsigned auto_increment primary key, unique hash char(32), and (assuming 64K chars suffices) text url.

Presumably you're talking about an entire URL, not just a hostname, including CGI parameters and other stuff.
SHA-1 hashing the URLs makes all the keys long, and makes sorting out trouble fairly obscure. I had to use indexes on hashes once to obscure some confidential data while maintaining the ability to join two tables, and the performance was poor.
There are two possible approaches. One is the naive and obvious one; it will actually work well in mySQL. It has advantages such as simplicity, and the ability to use URL LIKE 'whatever%' to search efficiently.
But if you have lots of URLs concentrated in a few domains ... for example ....
http://stackoverflow.com/questions/3735390/best-primary-key-for-storing-urls
http://stackoverflow.com/questions/3735391/how-to-add-a-c-compiler-flag-to-extconf-rb
etc, you're looking at indexes which vary only in the last characters. In this case you might consider storing and indexing the URLs with their character order reversed. This may lead to a more efficiently accessed index.
(The Oracle table server product happens has a built in way of doing this with a so-called reversed index.)
If I were you I would avoid an autoincrement key unless you have to join more than two tables ON TABLE_A.URL = TABLE_B.URL or some other join condition with that kind of meaing.

Depends on how you use the table. If you mostly select with WHERE url='<url>', then it's fine to have a one-column table. If you can use an autoincrement id to identify an URL in all places in your app, then use the autoincrement

Related

Primary key: a string or number (id)?

I am aware of benefits of using integers (amount of space, performance, indexes) as primary keys as opposite to strings.
Considering situation below...
I have a lookup table called ap_habitat (habitat values are also unique)
id habitat
1 Forest 1
2 Forest 2
Referenced table (fauna)
Especie habitat
X 1
Y 1
Referenced table is not very human readable (I know end users should not care about that, but for me would be useful to directly see in fauna table the NAME of the habitat).
To get a list of fauna and its habitat name I have to do a join...
select fauna.habitat, fauna.especie, AP_h.habitat from fauna INNER JOIN ap_habitat AS AP_h on AP_h.id=1
I could create a view, but if I have to create a view for each table referencing a foreign key...
Just wanna check what more experienced people recommend me.
Databases and, in general, computers are not designed to make your life more simple. They are designed to handle more data than a human mind can ever hope to remember in less time than it takes a human to blink. ;-)
Readability (especially in ideas conceived the before-Apple age) is not an issue at all.
On top of that: If you enjoy strange problems, data mapping impedance and spending endless nights writing workarounds for problems that using real-world names as primary keys get you for free, then be our guest. But please, don't ask for our help. We already know all the problems that you'll run into and it will be very hard for us to restrain our spite.
So: Never, ever use anything but an ID (UUID or long sequence) for a primary key. There are no (good) reasons to do it and if you found one, then you simply don't see the whole picture.
Yes, it makes a couple of things harder (like understanding what your data actually means). But as I said above, computers are meant to solve "lots of data" and "too slow" and nothing else.
Create a view or write a small helper application that can run your most important queries at the click of a button.
That said, I had some success with an application which runs a query and then displays a list of check boxes where I can pull in the foreign key relations to the data that the query returns (i.e. one checkbox per FK).
You ask about number or string as primary key. But based on your example if you use a string it wouldn't be a primary key at all, because you would no longer have a lookup table for it to be the primary key of. Perhaps you would still have the table for reasons not shown, like populating a drop down or storing extended descriptions beyond just the name.
Doing needless joins is not a good thing for performance. And having needless tables might be bad for storage size as well, depending on the length of the strings and the ratio of the sizes of the two tables.
You could also consider enumerated types, in which the data is stored as numbers (more or less) but the database translates them to and from strings automatically.

Should a primary key necessarily be auto-incrementing when I'm sure it is and will always be unique?

I've looked for a satisfying answer a tad more specific to my particular problem for a while now, but to avail. Whether I'm just not looking at the right places or not, I don't know, but here goes:
I'm pulling data from an application that afterwards is manipulated and sent to my own server. Amongst the data pulled is an, originally in the application's database, auto-incremented identifier. An example of this identifier I just now retrieved is 955534861. Isn't it better and more effective design to not auto-increment my primary key and just use the value I know is and will always stay unique, or should I look into concepts such as surrogate keys?
Thanks in advance.
The situation you describe resembles my primary job which is maintaining a data warehouse. We get data from other systems and store it.
Something that happens to us is that these "other systems" change. That leads to possibilities that the new version of the "other system" will duplicate the unique identifier from the previous system. We deal with this by adding something to that record in our data warehouse to guarantee it's uniqueness. It might be a field to identify the source system or it might be a date. It is never an autogenerated number.
If there is any chance of this this happening to you, you might want to expand your options.
If there is a natural key in your model, you cannot replace it by creating a surrogate key.
You can only add a surrogate key and keep the existing natural key, which has its pros and cons, as described here.
This'll get a little nerdy, but bear with me:
As long as a key value is unique, it'll serve its function. But for performance, you ideally want that key value to be as short as possible.
GUIDs are commonly used, because they are statistically highly unlikely to ever be repeated. But that comes at the expense of size: they are 128 bits long, which makes them longer than a machine word. To compare two GUIDs (as must be repeatedly done when sorting, or migrating down a b-tree for indexes) will take multiple processor intructions to load and compare the values. And they will consume more memory when cached into memory.
The advantage of auto-incrementing key values is that
They are guaranteed to be unique. Proxy index values are only predicted to be unique.
Because they will have full value coverage over the range of their underlying datatype, the most compact possible type may be used. This makes for smaller indexes and more efficient compare operations
Because the smallest possible type can be used, more index values can be stored on a single database page, which means you're more likely to get a cache hit when searching or joining on that value. That means that peformance will be--all other things being equal--somewhat better.
On most databases, auto-incrementing keys are worked into the database engine, so there is very small overhead in generating them.
If you employ a clustered index on your key value, new record inserts are less likely to require a random disk seek, and more likely to be read during read-ahead, so if you do any kind of sequential processing or lookup based on that key, it'll probably run faster.
The primary key, typically an auto-incrementing ID, is what MySQL uses as a row identifier as well, so it should be left alone. If you need a secondary key that's generated by your application for some other purpose, you may want to add that as another column with a UNIQUE index on it.
In other databases where there's a proper row identifier mechanism, this is less of an issue.

Primary key mysql using auto increment id or a sha1 hash?

I can either have an auto increment id field as my primary key or a sha1 hash.
Which one should I choose?
Which would be better in terms of performance?
There are a few application-driven cases where you'd want to use a globally unique ID (UUID/GUID):
You expect to (or are) using a sharding strategy to scale writes. You don't want the shard nodes to duplicate keys.
You want to be able to safely port data from one node to another preserving keys. This is critical if you want to keep foreign-key relationships in-tact.
Your application is also used offline (in-home sales, in-home repairs, etc.) where the offline application periodically syncs with the "source of truth". You'd want those offline keys to be unique without having to make a remote call. Otherwise, it is up to you to come up with a strategy to reorganize keys and relationships on the way in. With an auto-increment strategy and depending on the RDBMS you are using, this is likely a non-trivial task.
If you don't have a use-case from above or something similar, you may use an auto-increment id if that makes you comfortable; however, you may still want to consider UUID/GUID
The Trade Off:
There are a lot of opinions held about the speed / size of UUID/GUID keys. At the end of the day, it is a trade-off and there are lots of ways to gain or lose speed with a database. Ideally, you want your indexes to be stored in RAM in order to be as fast as possible; however, that is a trade-off you have to weigh against other considerations.
Other Considerations regarding UUID/GUID:
Many RDBMS can produce a UUID.
You can also produce a UUID via your application (you aren't tied to the RDBMS to generate).
Developers / Testers can easily port data from environment to environment and have the application work as expected. This is an often overlooked use-case; however, it is one of the stronger cases for using a UUID/GUID strategy.
There are databases that are optimized for use offline (CouchDB) where UUID is what you get.
Almost definitely an auto incrementing integer. It will be faster to create, faster to search, and smaller. Consider for example if you had another table that referenced it. Would you want it to reference it via an integral primary key or via a sha1 hash? An integer would be more meaningful (in a way), and it would be much (much!) more efficient.
Use an auto increment id.
An ID does not have to be generted only incremented.
Hashes fit better for storing passwords.
You could get duplicate keys using SHA hashes. The chance is small but real.
An ID is way more readable
An ID is kind of an inserttion history. You know which record was inserted last (highest ID)

Mysql - Should I use ID columns?

I have a doubt about best practices and how the database engine works.
Suppose I create a table called Employee, with the following columns:
SS ID (Primary Key)
Name
Sex
Age
The thing is.. I see a lot of databases that all its tables has and aditional column called ID, wich is a sequencial number. Should I put and ID field in my table here? I mean, it already has a Primary Key to be indexed. Will the database works faster with a sequencial ID field? I dont see how it helps if I wont use it to link or research any table.
Does it helps? If so, why, what happens in the database?
thanks!
EDIT -----
This is just a silly example. Forget about the SS_ID, I know there are better ways for choosing a primary key. The main topi is because some people I know just ask me to add the collumn named ID, even if I know we wont use it for any SQL query. They just think it helps the database's performance in some way, specially because some database tools like Microsoft Access always asks us if we want it to add this new column.
This is wrong, right?
If SS means "Social Security", I'd strongly advise against using that as a PK. An auto-incremented identity is the way to go.
Using keys with business logic built in is a bad idea. Lots of people are sensitive about giving SS information. Your app could be eliminating part of their audience if they use SS as primary key. Laws like HIPPA can make it impossible for you to use.
The actual performance gain in having a sequential id is going to depend a lot on how you use the table.
If you're using some ORM framework, these generally work better having a sequential ID of an integral type [1], which is typically achieved with an sequential id column.
If you don't use an ORM framework, having an idkey that you never use and a surrogate ss_id key which is effectively what you always use makes little sense.
If you're referencing employees from other database table (foreign-key), then it'll probably be more efficient to have an id column, as storing that integer is going to consume less space in the child tables than storing the ss_id (which I assume is a CHAR or VARCHAR) everywhere.
On the ss_id, assuming it's a social security number (looks like it would be), there might be legal & privacy concerns attached to it that you should care about - my answer assumes you do have valid reasons to have social security numbers in your database, and that you would be legally allowed to use & store them.
[1] This is usually explained by the fact the ORM frameworks rely on having highly specialized cache mechanisms, that are tailored for typical ORM use - which usually implies having a sequential id primary key, and letting application deal with actual business identity. This is in fact related to consideration very similar to these of the "foreign key" considerations.
US Social Security numbers are not sufficiently identifying. And banks certainly do not use them in that way. Not everybody has one. Errors result in duplicates. Foreigners don't have them. They are far too fragile to use as a database PK.
Most importantly: the are resused after death
Do some research: SSN as Primary Key
What's more important (obviously) is that you have a primary key, as long as the data you put use for that primary key will be uniquely identifiable. In your example, SSN's are uniquely identifiable which is why banks use them and will work. The problem with this example is that your Employee ID is likely to be used as a Foreign Key in other tables, which means you're taking personal information (that is legally protected) and spraying it across your data model. You might do better using an Auto Incremented field in this case.

Strings as Primary Keys in MYSQL Database [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
This post was edited and submitted for review 10 days ago.
Improve this question
I am not very familiar with databases and the theories behind how they work. Is it any slower from a performance standpoint (inserting/updating/querying) to use Strings for Primary Keys than integers?
For Example I have a database that would have about 100 million row like mobile number, name and email. mobile number and email would be unique. so can I have the mobile number or email as a primary key,
well it effect my query performance when I search based on email or mobile number. similarly the primary key well be used as foreign key in 5 to 6 tables or even more.
I am using MySQL database
Technically yes, but if a string makes sense to be the primary key then you should probably use it. This all depends on the size of the table you're making it for and the length of the string that is going to be the primary key (longer strings == harder to compare). I wouldn't necessarily use a string for a table that has millions of rows, but the amount of performance slowdown you'll get by using a string on smaller tables will be minuscule to the headaches that you can have by having an integer that doesn't mean anything in relation to the data.
Another issue with using Strings as a primary key is that because the index is constantly put into sequential order, when a new key is created that would be in the middle of the order the index has to be resequenced... if you use an auto number integer, the new key is just added to the end of the index.
Inserts to a table having a clustered index where the insertion occurs in the middle of the sequence DOES NOT cause the index to be rewritten. It does not cause the pages comprising the data to be rewritten. If there is room on the page where the row will go, then it is placed in that page. The single page will be reformatted to place the row in the right place in the page. When the page is full, a page split will happen, with half of the rows on the page going to one page, and half going on the other. The pages are then relinked into the linked list of pages that comprise a tables data that has the clustered index. At most, you will end up writing 2 pages of database.
Strings are slower in joins and in real life they are very rarely really unique (even when they are supposed to be). The only advantage is that they can reduce the number of joins if you are joining to the primary table only to get the name. However, strings are also often subject to change thus creating the problem of having to fix all related records when the company name changes or the person gets married. This can be a huge performance hit and if all tables that should be related somehow are not related (this happens more often than you think), then you might have data mismatches as well. An integer that will never change through the life of the record is a far safer choice from a data integrity standpoint as well as from a performance standpoint. Natural keys are usually not so good for maintenance of the data.
I also want to point out that the best of both worlds is often to use an autoincrementing key (or in some specialized cases, a GUID) as the PK and then put a unique index on the natural key. You get the faster joins, you don;t get duplicate records, and you don't have to update a million child records because a company name changed.
Too many variables. It depends on the size of the table, the indexes, nature of the string key domain...
Generally, integers will be faster. But will the difference be large enough to care? It's hard to say.
Also, what is your motivation for choosing strings? Numeric auto-increment keys are often so much easier as well. Is it semantics? Convenience? Replication/disconnected concerns? Your answer here could limit your options. This also brings to mind a third "hybrid" option you're forgetting: Guids.
It doesn't matter what you use as a primary key so long as it is UNIQUE. If you care about speed or good database design use the int unless you plan on replicating data, then use a GUID.
If this is an access database or some tiny app then who really cares. I think the reason why most of us developers slap the old int or guid at the front is because projects have a way of growing on us, and you want to leave yourself the option to grow.
Don't worry about performance until you have got a simple and sound design that agrees with the subject matter that the data describes and fits well with the intended use of the data. Then, if performance problems emerge, you can deal with them by tweaking the system.
In this case, it's almost always better to go with a string as a natural primary key, provide you can trust it. Don't worry if it's a string, as long as the string is reasonably short, say about 25 characters max. You won't pay a big price in terms of performance.
Do the data entry people or automatic data sources always provide a value for the supposed natural key, or is sometimes omitted? Is it occasionally wrong in the input data? If so, how are errors detected and corrected?
Are the programmers and interactive users who specify queries able to use the natural key to get what they want?
If you can't trust the natural key, invent a surrogate. If you invent a surrogate, you might as well invent an integer. Then you have to worry about whther to conceal the surrogate from the user community. Some developers who didn't conceal the surrogate key came to regret it.
Indices imply lots of comparisons.
Typically, strings are longer than integers and collation rules may be applied for comparison, so comparing strings is usually more computationally intensive task than comparing integers.
Sometimes, though, it's faster to use a string as a primary key than to make an extra join with a string to numerical id table.
Two reasons to use integers for PK columns:
We can set identity for integer field which incremented automatically.
When we create PKs, the db creates an index (Cluster or Non Cluster) which sorts the data before it's stored in the table. By using an identity on a PK, the optimizer need not check the sort order before saving a record. This improves performance on big tables.
Yes, but unless you expect to have millions of rows, not using a string-based key because it's slower is usually "premature optimization." After all, strings are stored as big numbers while numeric keys are usually stored as smaller numbers.
One thing to watch out for, though, is if you have clustered indices on a any key and are doing large numbers of inserts that are non-sequential in the index. Every line written will cause the index to re-write. if you're doing batch inserts, this can really slow the process down.
What is your reason for having a string as a primary key?
I would just set the primary key to an auto incrementing integer field, and put an index on the string field.
That way if you do searches on the table they should be relatively fast, and all of your joins and normal look ups will be unaffected in their speed.
You can also control the amount of the string field that gets indexed. In other words, you can say "only index the first 5 characters" if you think that will be enough. Or if your data can be relatively similar, you can index the whole field.
From performance standpoint - Yes string(PK) will slow down the performance when compared to performance achieved using an integer(PK), where PK ---> Primary Key.
From requirement standpoint - Although this is not a part of your question still I would like to mention. When we are handling huge data across different tables we generally look for the probable set of keys that can be set for a particular table. This is primarily because there are many tables and mostly each or some table would be related to the other through some relation ( a concept of Foreign Key ). Therefore we really cannot always choose an integer as a Primary Key, rather we go for a combination of 3, 4 or 5 attributes as the primary key for that tables. And those keys can be used as a foreign key when we would relate the records with some other table. This makes it useful to relate the records across different tables when required.
Therefore for Optimal Usage - We always make a combination of 1 or 2 integers with 1 or 2 string attributes, but again only if it is required.
I would probably use an integer as your primary key, and then just have your string (I assume it's some sort of ID) as a separate column.
create table sample (
sample_pk INT NOT NULL AUTO_INCREMENT,
sample_id VARCHAR(100) NOT NULL,
...
PRIMARY KEY(sample_pk)
);
You can always do queries and joins conditionally on the string (ID) column (where sample_id = ...).
There could be a very big misunderstanding related to string in the database are. Almost everyone has thought that database representation of numbers are more compact than for strings. They think that in db-s numbers are represented as in the memory. BUT it is not true. In most cases number representation is more close to A string like representation as to other.
The speed of using number or string is more dependent on the indexing then the type itself.
By default ASPNetUserIds are 128 char strings and performance is just fine.
If the key HAS to be unique in the table it should be the Key. Here's why;
primary string key = Correct DB relationships, 1 string key(The primary), and 1 string Index(The Primary).
The other option is a typical int Key, but if the string HAS to be unique you'll still probably need to add an index because of non-stop queries to validate or check that its unique.
So using an int identity key = Incorrect DB Relationships, 1 int key(Primary), 1 int index(Primary), Probably a unique string Index, and manually having to validate the same string doesn't exist(something like a sql check maybe).
To get better performance using an int over a string for the primary key, when the string HAS to be unique, it would have to be a very odd situation. I've always preferred to use string keys. And as a good rule of thumb, don't denormalize a database until you NEED to.