What should I store as my index in client code? - mysql

I'm using Google App Engine and storing users. Before, I was using MySQL with an auto-incrementing int for my userId field, but now GAE auto generates a key for each new entity I store, such as g5kZXZ-aGVsbG93b3JsZHIKCxIEVXNlchgNDA, but they also automatically generate an auto-incrementing ID int too.
Which one should I use in my client code to store as the userId? Are there any advantages to using the long key that GAE generates, or is using the small int ID the same thing in terms of performance and look ups? Are there any advantages to one over the other, or is there practically no difference?
Edit: Sorry my question was not clear enough. Here's what I'm asking:
It's not about length, but does having the lookup key put me a step ahead of not having it? Because if I wanted to look up a user, I'd have to look him up by email, but now that I have the key for that row in the "table", does this give me any sort of advantage?

Either one is fine, there's no performance difference between using a long string of letters or a short one to identify users.

Remember that the generated entity ID is not guaranteed to be a monotonically increasing value.

Related

Unique keys in distributed RDBMS

Imagine there's a Relational Database System (let's say MySQL) that is clustered in many servers (maybe 100 servers). In this Database System there is a table called "users", and "users" contains a primary key (UINT for instance).
This user ID must be unique among all the servers. This user ID may be auto incrementing.
So how does a distributed database system handle these types of problems ? How does a RDBMS generates a unique index that is unique among all the servers ?
I don't want any SQL code of how to do so in MySQL, I just need to know how it is done in such a case.
[Edit]
Both answers sounds OK.
This is another case, let's take StackOverflow for an example. This Question URL is http://stackoverflow.com/questions/18359434. Another URL is http://stackoverflow.com/questions/18359435, which points to the question that was asked after this question. Obviously stackoverflow has multiple database servers. But the ID for questions are auto-incrementing.
So what's the approach that StackOverflow is using ?
StackOverflow is getting a huge amount of traffic, about 100 both alexa and Quantacast ranks.
The canonical solution is to use uuid() (see here) rather than an integer for such a unique identifier. This is guaranteed to be unique in space as well as time.
A more "hacked" solution is to use two-part primary keys. Have the first be an identifier of "what system am I on" and the second be an auto-incremented number, unique to that system.
Another "hacked" solution is to give each system ranges. Say you are using big integers, then 1,000,000,000 might start the value on one system, 2,000,000,000 on another, and so on.
I would not recommend that you actually try to implement an auto-incremented number across a distributed system. This would basically entail having a single system that maintained the most recent number, and having the other systems ask it for the next number. However you implement this, you will introduce a bottleneck into the system.
In this case I'd use a GUID primary key and I wouldn't have this issue (not sure MySQL knows this though).
The alternative old-fashioned way is to use primary key ranges - that is have one instance use keys from 1.000.000 to 1.999.999, the next use range 2.000.000 to 2.999.999, etc, thus ensuring each instance cannot use the keys of another.

Best approach for having unique row IDs in the whole database rather than just in one table?

I'm designing a database for a project of mine, and in the project I have many different kinds of objects.
Every object might have comments on it - which it pulls from the same comments table.
I noticed I might run into problems when two different kind of objects have the same id, and when pulling from the comments table they will pull each other comments.
I could just solve it by adding an object_type column, but it will be harder to maintain when querying, etc.
What is the best approach to have unique row IDs across my whole database?
I noticed Facebook number their objects with a really, really large number IDs, and probably determine the type of it by id mod trillion or some other really big number.
Though that might work, are there any more options to achieve the same thing, or relying on big enough number ranges should be fine?
Thanks!
You could use something like what Twitter uses for their unique IDs.
http://engineering.twitter.com/2010/06/announcing-snowflake.html
For every object you create, you will have to make some sort of API call to this service, though.
Why not tweaking your concept of object_type by integrating it in the id column? For example, an ID would be a concatenation of the object type, a separator and a unique ID within the column.
This approach might scale better, as a unique ID generator for the whole database might lead to a performance bottleneck.
If you only have one database instance, you can create a new table to allocate IDs:
CREATE TABLE id_gen (
id BIGINT PRIMARY KEY AUTO_INCREMENT NOT NULL
);
Now you can easily generate new unique IDs and use them to store your rows:
INSERT INTO id_gen () VALUES ();
INSERT INTO foo (id, x) VALUES (LAST_INSERT_ID(), 42);
Of course, the moment you have to shard this, you're in a bit of trouble. You could set aside a single database instance that manages this table, but then you have a single point of failure for all writes and a significant I/O bottleneck (that only grows worse if you ever have to deal with geographically disparate datacenters).
Instagram has a wonderful blog post on their ID generation scheme, which leverages PostgreSQL's awesomeness and some knowledge about their particular application to generate unique IDs across shards.
Another approach is to use UUIDs, which are extremely unlikely to exhibit collisions. You get global uniqueness for "free", with some tradeoffs:
slightly larger size: a BIGINT is 8 bytes, while a UUID is 16 bytes;
indexing pains: INSERT is slower for unsorted keys. (UUIDs are actually preferable to hashes, as they contain a timestamp-ordered segment.)
Yet another approach (which was mentioned previously) is to use a scalable ID generation service such as Snowflake. (Of course, this involves installing, integrating, and maintaining said service; the feasibility of doing that is highly project-specific.)
I am using tables as object classes, rows as objects and columns as object parameters. Everything starts with the class techname, in which every object has its unique identifier, which is unique in the database. The object classes are registered as objects in the table object classes, and the parameters for each object class are linked to it.

Best primary key for storing URLs

which is the best primary key to store website address and page URLs?
To avoid the use of autoincremental id (which is not really tied to the data), I designed the schema with the use of a SHA1 signature of the URL as primary key.
This approach is useful in many ways: for example I don't need to read the last_id from the database so I can prepare all table updates calculating the key and do the real update in a single transaction. No constraint violation.
Anyway I read two books which tell me I am wrong. In "High performance MySQL" it is said that the random key is not good for the DB optimizer. Moreover, in each Joe Celko's books he says the primary key should be some part of the data.
The question is: the natural keys for URLs are... URLs themselves. The fact is that if for a site it is short (www.something.com), there's not an imposed limit for am URL (see http://www.boutell.com/newfaq/misc/urllength.html).
Consider I have to store (and work with) some millions of them.
Which is the best key, then? Autoincremental ids, URLs, hashes of URLs?
You'll want an autoincrement numeric primary key. For the times when you need to pass ids around or join against other tables (for example, optional attributes for a URL), you'll want something small and numeric.
As for what other columns and indexes you want, it depends, as always, on how you're going to use them.
A column storing a hash of each URL is an excellent idea for almost any application that uses a significant number of URLs. It makes SELECTing a URL by its full text about as fast as it's going to get. A second advantage is that if you make that column UNIQUE, you don't need to worry about making the column storing the actual URL unique, and you can use REPLACE INTO and INSERT IGNORE as simple, fast atomic write operations.
I would add that using MySQL's built-in MD5() function is just fine for this purpose. Its only disadvantage is that a dedicated attacker can force collisions, which I'm quite sure you don't care about. Using the built-in function makes, for example, some types of joins much easier. It can be a tiny bit slower to pass a full URL across the wire ("SELECT url FROM urls WHERE hash=MD5('verylongurl')" instead of "WHERE hash='32charhexstring'"), but you'll have the option to do that if you want. Unless you can come up with a concrete scenario where MD5() will let you down, feel free to use it.
The hard question is whether and how you're going to need to look up URLs in ways other than their full text: for example, will you want to find all URLs starting with "/foo" on any "bar.com" host? While "LIKE '%bar.com%/foo%'" will work in testing, it will fail miserably at scale. If your needs include things like that, you can come up with creative ways to generate non-UNIQUE indexes targeted at the type of data you need... maybe a domain_name column, for starters. You'll have to populate those columns from your application, almost certainly (triggers and stored procedures are a lot more trouble than they're worth here, especially if you're concerned about performance -- don't bother).
The good news is that relational databases are very flexible for that sort of thing. You can always add new columns and populate them later. I would suggest for starters: int unsigned auto_increment primary key, unique hash char(32), and (assuming 64K chars suffices) text url.
Presumably you're talking about an entire URL, not just a hostname, including CGI parameters and other stuff.
SHA-1 hashing the URLs makes all the keys long, and makes sorting out trouble fairly obscure. I had to use indexes on hashes once to obscure some confidential data while maintaining the ability to join two tables, and the performance was poor.
There are two possible approaches. One is the naive and obvious one; it will actually work well in mySQL. It has advantages such as simplicity, and the ability to use URL LIKE 'whatever%' to search efficiently.
But if you have lots of URLs concentrated in a few domains ... for example ....
http://stackoverflow.com/questions/3735390/best-primary-key-for-storing-urls
http://stackoverflow.com/questions/3735391/how-to-add-a-c-compiler-flag-to-extconf-rb
etc, you're looking at indexes which vary only in the last characters. In this case you might consider storing and indexing the URLs with their character order reversed. This may lead to a more efficiently accessed index.
(The Oracle table server product happens has a built in way of doing this with a so-called reversed index.)
If I were you I would avoid an autoincrement key unless you have to join more than two tables ON TABLE_A.URL = TABLE_B.URL or some other join condition with that kind of meaing.
Depends on how you use the table. If you mostly select with WHERE url='<url>', then it's fine to have a one-column table. If you can use an autoincrement id to identify an URL in all places in your app, then use the autoincrement

MySQL Table primary keys

Greetings,
I have some mysql tables that are currently using an md5 hash as a primary key. I normally generate the hash with the value of a column. For instante, let's imagine I have a table called "Artists" with the fields id, name, num_members, year. I tend to make a md5($name) and use it has an ID.
I would like to know what are the downsides of doing this. Is it just better to use integers with AUTO_INCREMENT ? I tend to run away from this because it's just not worth the trouble of finding out what the last id inserted was, and what will be the next etc.
Can you give me some lights on this?
Thank you.
If you need a surrogate primary key, using an AUTO_INCREMENT field is better than an md5 hash, because it is fewer bytes of data, and database backends optimize for integer primary keys.
mysql_insert_id can be used if you need the last inserted id.
If you are generating the primary key as a hash of other columns, why not just use those other columns as a unique key, then join on those?
Another question is, what are the upsides of using an md5 hash? I can't think of any.
The MD5 isn't a true key in this case because it functionally depends on the name. That means that if you have two artists with the same name, you have duplicate "keys" for different records. You could make it a real key by hashing all the attributes together (and hoping that the probability gods don't send you a collision), or you could just save yourself the trouble and use an autoincrementing ID.
It seems like the way you're trying to use the MD5 isn't really buying you any benefit. If "$name" is unique, then why not just use "name" as the primary key? Calculating an MD5 hash and using it as a key for something that's already unique is redundant.
On the other hand, if "name" is not unique, then the MD5 hash won't be unique either and so it's pointless that way too.
Generally you use an MD5 hash when you don't want to store the actual value of the column. For instance, if you're storing passwords, you generally only store the MD5 hash of the password, not the password itself, so that you can't see people's passwords just by looking at the table contents.
If you don't have any unique fields, then you're stuck doing something like an auto-increment because it's at least guaranteed unique. If you use the built-in SQL auto-increment, then you'll just have to fetch the last one way or another. Alternately, if you can get away with keeping a unique counter locally in your application, that avoids having to use auto-increment, but isn't necessarily viable for most applications.
The first approach has one obvious disadvantage: if there are two artists of the same name there will be a primary key collision. Using an INT column with an auto-increment will ensure uniqueness.
Furthermore, though very unlikely, there is a chance that MD5 hashes of different strings could collide (I seem to recall the probability as being 1 in 36 to the power of 32).
The benefits are if you present the IDs to customers (say in a query string for a web form, though that is another no-no)... it prevents users guessing another one.
Personally I use auto-increment without problems (have moved DBs to new servers and everything without problems)

Strings as Primary Keys in MYSQL Database [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
This post was edited and submitted for review 10 days ago.
Improve this question
I am not very familiar with databases and the theories behind how they work. Is it any slower from a performance standpoint (inserting/updating/querying) to use Strings for Primary Keys than integers?
For Example I have a database that would have about 100 million row like mobile number, name and email. mobile number and email would be unique. so can I have the mobile number or email as a primary key,
well it effect my query performance when I search based on email or mobile number. similarly the primary key well be used as foreign key in 5 to 6 tables or even more.
I am using MySQL database
Technically yes, but if a string makes sense to be the primary key then you should probably use it. This all depends on the size of the table you're making it for and the length of the string that is going to be the primary key (longer strings == harder to compare). I wouldn't necessarily use a string for a table that has millions of rows, but the amount of performance slowdown you'll get by using a string on smaller tables will be minuscule to the headaches that you can have by having an integer that doesn't mean anything in relation to the data.
Another issue with using Strings as a primary key is that because the index is constantly put into sequential order, when a new key is created that would be in the middle of the order the index has to be resequenced... if you use an auto number integer, the new key is just added to the end of the index.
Inserts to a table having a clustered index where the insertion occurs in the middle of the sequence DOES NOT cause the index to be rewritten. It does not cause the pages comprising the data to be rewritten. If there is room on the page where the row will go, then it is placed in that page. The single page will be reformatted to place the row in the right place in the page. When the page is full, a page split will happen, with half of the rows on the page going to one page, and half going on the other. The pages are then relinked into the linked list of pages that comprise a tables data that has the clustered index. At most, you will end up writing 2 pages of database.
Strings are slower in joins and in real life they are very rarely really unique (even when they are supposed to be). The only advantage is that they can reduce the number of joins if you are joining to the primary table only to get the name. However, strings are also often subject to change thus creating the problem of having to fix all related records when the company name changes or the person gets married. This can be a huge performance hit and if all tables that should be related somehow are not related (this happens more often than you think), then you might have data mismatches as well. An integer that will never change through the life of the record is a far safer choice from a data integrity standpoint as well as from a performance standpoint. Natural keys are usually not so good for maintenance of the data.
I also want to point out that the best of both worlds is often to use an autoincrementing key (or in some specialized cases, a GUID) as the PK and then put a unique index on the natural key. You get the faster joins, you don;t get duplicate records, and you don't have to update a million child records because a company name changed.
Too many variables. It depends on the size of the table, the indexes, nature of the string key domain...
Generally, integers will be faster. But will the difference be large enough to care? It's hard to say.
Also, what is your motivation for choosing strings? Numeric auto-increment keys are often so much easier as well. Is it semantics? Convenience? Replication/disconnected concerns? Your answer here could limit your options. This also brings to mind a third "hybrid" option you're forgetting: Guids.
It doesn't matter what you use as a primary key so long as it is UNIQUE. If you care about speed or good database design use the int unless you plan on replicating data, then use a GUID.
If this is an access database or some tiny app then who really cares. I think the reason why most of us developers slap the old int or guid at the front is because projects have a way of growing on us, and you want to leave yourself the option to grow.
Don't worry about performance until you have got a simple and sound design that agrees with the subject matter that the data describes and fits well with the intended use of the data. Then, if performance problems emerge, you can deal with them by tweaking the system.
In this case, it's almost always better to go with a string as a natural primary key, provide you can trust it. Don't worry if it's a string, as long as the string is reasonably short, say about 25 characters max. You won't pay a big price in terms of performance.
Do the data entry people or automatic data sources always provide a value for the supposed natural key, or is sometimes omitted? Is it occasionally wrong in the input data? If so, how are errors detected and corrected?
Are the programmers and interactive users who specify queries able to use the natural key to get what they want?
If you can't trust the natural key, invent a surrogate. If you invent a surrogate, you might as well invent an integer. Then you have to worry about whther to conceal the surrogate from the user community. Some developers who didn't conceal the surrogate key came to regret it.
Indices imply lots of comparisons.
Typically, strings are longer than integers and collation rules may be applied for comparison, so comparing strings is usually more computationally intensive task than comparing integers.
Sometimes, though, it's faster to use a string as a primary key than to make an extra join with a string to numerical id table.
Two reasons to use integers for PK columns:
We can set identity for integer field which incremented automatically.
When we create PKs, the db creates an index (Cluster or Non Cluster) which sorts the data before it's stored in the table. By using an identity on a PK, the optimizer need not check the sort order before saving a record. This improves performance on big tables.
Yes, but unless you expect to have millions of rows, not using a string-based key because it's slower is usually "premature optimization." After all, strings are stored as big numbers while numeric keys are usually stored as smaller numbers.
One thing to watch out for, though, is if you have clustered indices on a any key and are doing large numbers of inserts that are non-sequential in the index. Every line written will cause the index to re-write. if you're doing batch inserts, this can really slow the process down.
What is your reason for having a string as a primary key?
I would just set the primary key to an auto incrementing integer field, and put an index on the string field.
That way if you do searches on the table they should be relatively fast, and all of your joins and normal look ups will be unaffected in their speed.
You can also control the amount of the string field that gets indexed. In other words, you can say "only index the first 5 characters" if you think that will be enough. Or if your data can be relatively similar, you can index the whole field.
From performance standpoint - Yes string(PK) will slow down the performance when compared to performance achieved using an integer(PK), where PK ---> Primary Key.
From requirement standpoint - Although this is not a part of your question still I would like to mention. When we are handling huge data across different tables we generally look for the probable set of keys that can be set for a particular table. This is primarily because there are many tables and mostly each or some table would be related to the other through some relation ( a concept of Foreign Key ). Therefore we really cannot always choose an integer as a Primary Key, rather we go for a combination of 3, 4 or 5 attributes as the primary key for that tables. And those keys can be used as a foreign key when we would relate the records with some other table. This makes it useful to relate the records across different tables when required.
Therefore for Optimal Usage - We always make a combination of 1 or 2 integers with 1 or 2 string attributes, but again only if it is required.
I would probably use an integer as your primary key, and then just have your string (I assume it's some sort of ID) as a separate column.
create table sample (
sample_pk INT NOT NULL AUTO_INCREMENT,
sample_id VARCHAR(100) NOT NULL,
...
PRIMARY KEY(sample_pk)
);
You can always do queries and joins conditionally on the string (ID) column (where sample_id = ...).
There could be a very big misunderstanding related to string in the database are. Almost everyone has thought that database representation of numbers are more compact than for strings. They think that in db-s numbers are represented as in the memory. BUT it is not true. In most cases number representation is more close to A string like representation as to other.
The speed of using number or string is more dependent on the indexing then the type itself.
By default ASPNetUserIds are 128 char strings and performance is just fine.
If the key HAS to be unique in the table it should be the Key. Here's why;
primary string key = Correct DB relationships, 1 string key(The primary), and 1 string Index(The Primary).
The other option is a typical int Key, but if the string HAS to be unique you'll still probably need to add an index because of non-stop queries to validate or check that its unique.
So using an int identity key = Incorrect DB Relationships, 1 int key(Primary), 1 int index(Primary), Probably a unique string Index, and manually having to validate the same string doesn't exist(something like a sql check maybe).
To get better performance using an int over a string for the primary key, when the string HAS to be unique, it would have to be a very odd situation. I've always preferred to use string keys. And as a good rule of thumb, don't denormalize a database until you NEED to.