MySQL: Using DATETIME as primary key - mysql

My database will be storing a large number of data points, so I am using an unsigned BIGINT as the primary key.
Would it ever make sense to use a DATETIME object as the primary key?
Thanks,

Yes if course it makes sense for a date/time to be a key or part of a key if you need to uniquely identify discrete points or periods of time. I can't say if that applies to your scenario but as a general rule there's no fundamental reason why keys can't be based on time - almost any data warehouse does it.

No because it can't be guaranteed to be unique. Stick with BIGINT. You can put a nice index on the DateTime for querying and it will be good enough.

It wouldn't make sense, as you would be limited to one record per second without any actual reason for that.

It makes sense if your data comes from a single time-ordered set. Say, a record of financial transactions. If you have multiple data points which naturally occurred at different instants, but have the same timestamp due to rounding, change the low-order bits to discriminate them.
This is more problematic in MySQL than in other databases, because timestamps are stored with only 1-second precision. (Edit: as of 5.6.4, MySQL has microseconds precision on time types)

If you happen to have multiple observations per second this will fail. For this reason it's probably better not to unless you can guarantee that there will never be more than one point per second.

Related

MYSQL: What is the impact of varchar length on performance when used a primary key? [duplicate]

What would be the performance penalty of using strings as primary keys instead of bigints etc.? String comparison is much more expensive than integer comparison, but on the other hand I can imagine that internally a DBMS will compute hash keys to reduce the penalty.
An application that I work on uses strings as primary keys in several tables (MySQL). It is not trivial to change this, and I'd like to know what can be gained performance wise to justify the work.
on the other hand I can imagine that
internally a DBMS will compute hash
keys to reduce the penalty.
The DB needs to maintain a B-Tree (or a similar structure) with the key in a way to have them ordered.
If the key is hashed and stored it in the B-Tree that would be fine to check rapidly the uniqueness of the key -- the key can still be looked up efficiently. But you would not be able to search efficient for range of data (e.g. with LIKE) because the B-Tree is no more ordered according to the String value.
So I think most DB really store the String in the B-Tree, which can (1) take more space than numeric values and (2) require the B-Tree to be re-balanced if keys are inserted in arbitrary order (no notion of increasing value as with numeric pk).
The penalty in practice can range from insignificant to huge. It all depends on the usage, the number of rows, the average size of the string key, the queries which join table, etc.
In our product we use varchar(32) for primary keys (GUIDs) and we haven't met performance issues of this. Our product is a web site with extreme overload and is critical to be stable.
We use SQL Server 2005.
Edit: In our biggest tables we have more than 3 000 000 records with lots of inserts and selects from them. I think in general, the benefit of migrating to int key will be very low, but the problems while migrating very high.
One thing to watch out for is page splits (I know this can happen in SQL Server - probably the same in MySQL).
Primary keys are physically ordered. By using an auto-increment integer you guarantee that each time you insert you are inserting the next number up, so there is no need for the db to reorder the keys. If you use strings however, the pk you insert may need to be placed in the middle of the other keys to maintain the pk order. That process of reordering the pks on the insert can get expensive.
It depends on several factors: RDBMS, number of indexes involving those columns but in general it will be more efficient using ints, folowed by bigints.
Any performance gains depend on usage, so without concrete examples of table schema and query workload it is hard to say.
Unless it makes sense in the domain (I'm thinking unique something like social security number), a surrogate integer key is a good choice; referring objects do not need to have their FK reference updated when the referenced object changes.

What are the merits of using numeric row IDs in MySQL?

I'm new to SQL, and thinking about my datasets relationally instead of hierarchically is a big shift for me. I'm hoping to get some insight on the performance (both in terms of storage space and processing speed) versus design complexity of using numeric row IDs as a primary key instead of string values which are more meaningful.
Specifically, this is my situation. I have one table ("parent") with a few hundred rows, for which one column is a string identifier (10-20 characters) which would seem to be a natural choice for the table's primary key. I have a second table ("child") with hundreds of thousands (or possibly millions or more) of rows, where each row refers to a row in the parent table (so I could create a foreign key constraint on the child table). (Actually, I have several tables of both types with a complex set of references among them, but I think this gets the point across.)
So I need a column in the child table that gives an identifier to rows in the parent table. Naively, it seems like creating the column as something like VARCHAR(20) to refer to the "natural" identifier in the first table would lead to a huge performance hit, both in terms of storage space and query time, and therefore I should include a numeric (probably auto_increment) id column in the parent table and use this as the reference in the child. But, as the data that I'm loading into MySQL don't already have such numeric ids, it means increasing the complexity of my code and more opportunities for bugs. To make matters worse, since I'm doing exploratory data analysis, I may want to muck around with the values in the parent table without doing anything to the child table, so I'd have to be careful not to accidentally break the relationship by deleting rows and losing my numeric id (I'd probably solve this by storing the ids in a third table or something silly like that.)
So my question is, are there optimizations I might not be aware of that mean a column with hundreds of thousands or millions of rows that repeats just a few hundred string values over and over is less wasteful than it first appears? I don't mind a modest compromise of efficiency in favor of simplicity, as this is for data analysis rather than production, but I'm worried I'll code myself into a corner where everything I want to do takes a huge amount of time to run.
Thanks in advance.
I wouldn't be concerned about space considerations primarily. An integer key would typically occupy four bytes. The varchar will occupy between 1 and 21 bytes, depending on the length of the string. So, if most are just a few characters, a varchar(20) key will occupy more space than an integer key. But not an extraordinary amount more.
Both, by the way, can take advantage of indexes. So speed of access is not particularly different (of course, longer/variable length keys will have marginal effects on index performance).
There are better reasons to use an auto-incremented primary key.
You know which values were most recently inserted.
If duplicates appear (which shouldn't happen for a primary key of course), it is easy to determine which to remove.
If you decide to change the "name" of one of the entries, you don't have to update all the tables that refer to it.
You don't have to worry about leading spaces, trailing spaces, and other character oddities.
You do pay for the additional functionality with four more bytes in a record devoted to something that may not seem useful. However, such efficiencies are premature and probably not worth the effort.
Gordon is right (which is no surprise).
Here are the considerations for you not to worry about, in my view.
When you're dealing with dozens of megarows or less, storage space is basically free. Don't worry about the difference between INT and VARCHAR(20), and don't worry about the disk space cost of adding an extra column or two. It just doesn't matter when you can buy decent terabyte drives for about US$100.
INTs and VARCHARS can both be indexed quite efficiently. You won't see much difference in time performance.
Here's what you should worry about.
There is one significant pitfall in index performance, that you might hit with character indexes. You want the columns upon which you create indexes to be declared NOT NULL, and you never want to do a query that says
WHERE colm IS NULL /* slow! */
or
WHERE colm IS NOT NULL /* slow! */
This kind of thing defeats indexing. In a similar vein, your performance will suffer bigtime if you apply functions to columns in search. For example, don't do this, because it too defeats indexing.
WHERE SUBSTR(colm,1,3) = 'abc' /* slow! */
One more question to ask yourself. Will you uniquely identify the rows in your subsidiary tables, and if so, how? Do they have some sort of natural compound primary key? For example, you could have these columns in a "child" table.
parent varchar(20) pk fk to parent table
birthorder int pk
name varchar(20)
Then, you could have rows like...
parent birthorder name
homer 1 bart
homer 2 lisa
homer 3 maggie
But, if you tried to insert a fourth row here like this
homer 1 badbart
you'd get a primary key collision because (homer,1) is occupied. It's probably a good idea to work how you'll manage primary keys for your subsidiary tables.
Character strings containing numbers sort funny. For example, '2' comes after '101'. You need to be on the lookout for this.
The main benefit you get from numeric values that that they are easier to 'index'. Indexing is a process that MySQL uses to make it easier to find a value.
Typically, if you want to find a value in a group, you have to loop through the group looking for your value. That is slow and has a worst case of O(n). If instead your data was in a nice, searchable format -- like a binary search tree, if could be found in O(lon n), much faster.
Indexing is the process MySQL uses to prepare data to be searched, it generates search trees and other clever do-bobs that will make finding data quick. It makes many searches much faster. However, to do it, it has to compare the value you are searching for to various 'key' values to determine if your value is greater than or less than the key.
This comparison can be done on non-numeric values. However, comparing non-numeric values is much slower. If you want to be able to quickly look up data, your best bet is you have a integer 'key' that you use.
Numeric row id's have many advantages over a string based id.
Most of them are mentioned in other answers.
1. One of them is indexing. Primary keys are by default indexed in a relational database. So, having a numeric key is always more efficient.
2. Numeric fields are stored much more efficiently
2. Joins are much faster with numeric keys.
3. A row id could be a foreign key. Numeric id's are compact to store, making them efficient
4. I think using a auto-increment on primary key has its advantages too
-Thanks
_san

SQL - performance in varchar vs. int

I have a table which has a primary key with varchar data type. And another table with foreign key as varchar datatype.
I am making a join statement using this pair of varchar datatype. Though I am dealing with few number of rows (say hunderd rows), it is taking 60ms. But when the system will finally be deployed, it will have around thousands of rows.
I read Performance of string comparison vs int join in SQL, and concluded that the performance of SQL Query depend upon DB and number of rows it is dealing with.
But when I am dealing with a very large amount of data, would it matter much?
Should I create a new column with a number datatype in both the table and join the table to reduce the time taken by the SQL Query.?
You should use the correct data type for that data that you are representing -- any dubious theoretical performance gains are secondary to the overhead of having to deal with data conversions.
It's really impossible to say what that is based on the question, but most cases are rather obvious. Where they are not obvious are in situations where you have a data element that is represented by a set of digits but which you do not treat as a number -- for example, a phone number.
Clues that you are dealing with this situation are:
leading zeroes that must be preserved
no arithmetic operations are carried out on the element.
string operations are carried out: eg. "take the last four characters"
If that's the case then you probably want to store your "number" as a varchar.
Yes, you should give that a shot. But before you do, make a test version of your db that you populate with the level of data you expect to have in production, and run some tests on not just SELECT, but also INSERT, UPDATE, and DELETE as well. Then make a version with integer keys, and perform equvialent tests.
The numeric-keys WILL be faster, for the simple reason that the keys are of smaller size, but the difference may not be noticeable. Don't blindly trust books when you can test and measure the difference yourself.
(One thing to remember: if there are occasions when all you need from a relation is the value you currently have as its key, your database may run significantly faster if you can skip entire table lookups by just referencing the foreign-key on the records you have.)
Should I create a new column with a number datatype in both the table and join the table to reduce the time taken by the SQL Query.?
If you're in a position where you can change the design of the database with ease then yes, your Primary Key should be an integer. Unless there is a really good reason to have an FK as a varchar, then they should be integers as well.
If you can't change the PK or FK fields, then make sure they're indexed properly. This will eventually become a bottleneck though.
It just does not sound right to me. It will use more space result in more reads etc. Then is the varchar the clustered index key? If so the table is going to get very fragmented.

Need help understanding MySQL PACK_KEYS

I am using a BIGINT to hold an id number that will increment from 1. In one table this will be the Primary Key and will, of course, be unique; in other tables it will be a foreign key. I'm trying to figure out whether this key will be "packed" if I set PACK_KEYS, since there will be a lot of leading zeroes.
I'm having difficulty understanding the MySQL doc for the PACK_KEYS table option in table creation. Here is the relevant quote from the doc:
When packing binary number keys, MySQL uses prefix compression:
Every key needs one extra byte to indicate how many bytes of the
previous key are the same for the next key.
The pointer to the row is stored in high-byte-first order directly
after the key, to improve compression.
This means that if you have many equal keys on two consecutive rows,
all following “same” keys usually only take two bytes (including the
pointer to the row). Compare this to the ordinary case where the
following keys takes storage_size_for_key + pointer_size (where the
pointer size is usually 4). Conversely, you get a significant benefit
from prefix compression only if you have many numbers that are the
same. If all keys are totally different, you use one byte more per
key, if the key is not a key that can have NULL values. (In this case,
the packed key length is stored in the same byte that is used to mark
if a key is NULL.)
They've lost me with "many equal keys on two consecutive rows,
all following “same” keys usually only take two bytes (including the
pointer to the row)". Can someone interpret the above doc for me, in light of what I'm trying to accomplish? E.g., for a primary key there won't be ANY "equal keys" - on two consecutive rows, on three consecutive rows, on 100 non-consecutive rows... or whatever they're driving at.
Thanks!
Chances are you do not need PACK_KEYS. I see you are using BIGINT for your PK. How many rows are you looking at having in this table eventually?? What kind of data are you storing? How do you intend to retrieve/report on it and how often?? These are things I would consider first before using this feature.
If I read that documentation correctly, it's basically stating that if you have two consecutive records with long PKs say:
PK-x: 1002350025789001
PK-y: 1002350025789002
With PACK_KEYS, PK-y now becomes something like "[pointer to PK-x]2"
It's basically a way of saying PK-2 is the same as PK-1 except for the last number which is 2... without having to rewrite/store the same refix/preceding numbers.
The gains from this are most likely only realized when you are dealing with very long PKs and will mostly be gains in storage/memory, however I would imagine there's a cost to overall performance which may or may not be noticeable depending on how much access load that table gets.
May not be worth it... I've never used this feature, and I've built some pretty heavy apps on MySQL.
hope this helps.
Good Luck

Theoretical situation about MySQL

I searched Google for a question I ask myself since this morning but couldn't find any information or article about it.
I was wondering, in the following situation, to improve performance (a little % still) :
Context: I have two column : ID, AddedAt (AddedAt is the Unix Timestamp of when the row is created).
Theoretically, if you insert a new row, ID will be +1 and AddedAt will be the current time.
Now, let's say it is impossible in the current situation to have two simultaneous insert, would it be better to use AddedAt as a PK and remove the ID column ? AddedAt will be only one and unique column that does PK and UNIX Timestamp. So in the final, I will have one column instead of two.
The only bad side I see is maybe the size of the key that will be created on AddedAt since unix timestamp now's day is 10 digits.
Would it be better, in this situation ? What's your opinion ?
EDIT: What about using timestamp + ms ?
Timestamps are in seconds. While you might not have simultaneous inserts, as the world tends to speed up you might get multiple inserts in a second. Build your system to function soundly--don't use timesamps as primary keys.
Also, with statement replication sometime timestamps arent consistent across dbs... Row based replication alleviates this, but still its another reason for concern when using them.
From an good convention standpoint, Primary Keys should have some clear meaning to others outside yourself if it's anything other than just us a plain old auto incrementing id field. Generally, people expect numbers or char values for keys, not things like blobs, timestamps, datetimes, etc... This is especially true if later it's used for as a foreign key in another table, using timestamp as a foreign key can be confusing to later developers. Sure, if you have a varchar GUID field you know is unique, use it as the key. Just remember when used as a foreign key your going to eat up also quite a bit of memory if you have a huge string.
Assuming you can guarantee that two events won't occur within the same 1-second interval, then sure, you could use the timestamp field as a PK.
That being said, why are you worried about key sizes? A timestamp may be 10 digits, but its internal storage requirements is only 4 bytes. By comparison, an int is also 4 bytes, so you wouldn't be losing anything - unless you're using bigints, in which case it's 8 bytes.
Also, note that timestamp fields are subject to the y2038k problem. They're essentially unix timestamps that auto-format into a human readable date for you. If your app is going to be around for more than 26 years, then you should stick with an int/bigint, which has a wraparound range of "however fast you insert rows", not a fixed date/time.
The primary key is not only a technical thing, it is the business representation of something that makes each object represented by a row unique.
A timestamp is a unique field of your object because you cannot (in your case) insert two objects at the same time, but it is NOT the primary definition of a business object (if you had a business object called "timestamp" then yes, the time when it was inserted should be the primary key)
An ID stands for "my client has a physical id that represents him": in the past, we would give numbers to clients on papers, bills...
Never forget that computer science is not the objective per se but the means to achieve your goals.
I would leave the ID column as the primary key as there may be scenarios in which the unix timestamp will give you a value you're not expecting. One could be inserting very fast in succession returns the same timestamp, and another is if the server admin decides to monkey with the servers time settings.
Doing joins will probably much more obvious as people typically expect the primary key to be some sort of unique id, not a timestamp.
Yes of course, but performance gain will be minimal only while adding new record.
Moreover you will be forced to use timestamp for foreign_keys in all related objects.
It is worth considering only if you expect many inserts per second and a lot of records (to save storage on id column and its index), but as you said timestamp will be unique, so it's max 1 record per second :-)