From a long time ago, and because several reasons, I have understood that no DATETIME columns should not form part of the primary key of a table. Between these reasons, I think it is a bad idea given the high precision of this field. An example, 2014-06-26 15:35:12 won't match 2014-06-26 15:35:13.
Questions like Use timestamp(or datetime) as part of primary key (or part of clustered index) seem to support this "phobia".
However I am facing now a very concrete problem: I want to map into a MySQL table some values of a function like
f:(TimeInDay,TimeInDay) -> Integer
Where the arguments represent a time interval (with second precision) within the same day.
Unique (TimeInDay,TimeInDay) pairs results in a concrete output value. So I came to this table structure:
CREATE TABLE sessions_schedule
(
tIni TIME NOT NULL,
tEnd TIME NOT NULL,
X tinyInt,
CONSTRAINT pk PRIMARY KEY (tIni, tEnd)
);
Where TIMEs compose the primary key.
In the MySQL online manual I found:
MySQL recognizes TIME values in several formats,... Some of these
formats can include a trailing fractional seconds part in up to
microseconds (6 digits) precision. Although this fractional part is
recognized, it is discarded from values stored into TIME columns.
So, it seems to me, that in this case the inclusion of TIME fields in the primary key is justified. Am I right?
From a long time ago, and because several reasons, I have understood
that no DATETIME columns should not form part of the primary key of a
table.
That's not true for the relational model, it's not true of SQL in general, and it's not true of MySQL in particular.
Between these reasons, I think it is a bad idea given the high
precision of this field. An example, 2014-06-26 15:35:12 won't match
2014-06-26 15:35:13.
Your example isn't a good one. Think about using integers instead. Would you expect the integer 3 to match the integer 4? Of course not. So why would you think '2014-06-26 15:35:12' would match '2014-06-26 15:35:13'? They're different values. Different values aren't supposed to match.
So, it seems to me, that in this case the inclusion of TIME fields in
the primary key is justified. Am I right?
Quite likely. You just have to make sure that you
don't store any values more precise than a second, and
tIni is before tEnd.
(MySQL can store trailing microseconds.)
On other platforms, you'd probably use CHECK constraints to enforce those requirements, but MySQL doesn't enforce CHECK constraints. You'll need to write triggers, or revoke permissions on the tables, and require changes to go through a stored procedure.
Related
I am thinking about the best way to index my data. Is it a good idea to use the timestamp as my primary key? I am saving it anyway and I though about saving some columns. The timestamp should be an integer not a datetime column, because of performance. Moreover I don't want to be restricted on the amount of data in a short time (between two seconds). Therefore, I thought about an additionary AUTO_INCREMENT column. Now I have a unique key (timestamp and AI) and I can get the current inserted id easily by using the command "LAST_INSERT_ID". Is it possible to reset the AI counter every second / when there is a new timestamp? Or is it possible to detect if there is a dataset with the same timestamp and increase the AI value (I still want to be able to use LAST_INSERT_ID).
Please share some thoughts.
The timestamp should be an integer not a datetime column, because of performance.
I think you are of the belief that datetime is stored as a string. It is stored as numbers quite efficiently and with a wider range and more accuracy than an integer.
Using an integer may decrease performance because the database may not be able to correctly index it for use as a timestamp. It will complicate queries because you will not be able to use the full suite of date and time functions without first converting the integer to a datetime.
Use the appropriate date/time type, index it, and let the database optimize it.
Moreover I don't want to be restricted on the amount of data in a short time (between two seconds). Therefore, I thought about an [additional] AUTO_INCREEMENT column.
This would seem to defeat the point of "saving some columns". Now your primary key is two integers. Worse, it's a compound key which requires all references to store both values increasing storage requirements and complicating joins.
All the extra work necessary to determine the next primary key could be done in an insert trigger, but now you'd added complexity and extra work to every insert.
Is it a good idea to use the timestamp as my primary key?
A primary key should be A) unique and B) immutable. A timestamp is not unique, and you might need to change it.
Your primary key is unlikely to be a performance or storage bottleneck. Unless you have a good reason, stick with a simple, auto-incrementing big integer. A big integer because 2 billion is smaller than you think.
MySQL encapsulates this in serial which is bigint unsigned not null auto_increment unique.
TIMESTAMP and DATETIME are risky as a PRIMARY KEY since the PK must be Unique.
Otherwise, it is fine to use them for the PK or an index. But here are some caveats:
When using composite indexes (multi-column), put the things tested with = first; put the datetime last.
Smaller is slightly better when picking a PK. TIMESTAMP and DATETIME take 5 bytes (when not including microseconds); INT is 4 bytes; BIGINT is 8.
The time taken for comparing one PK value to another is insignificant. That includes character PKs. For example, country_code CHAR(2) CHARACTER SET ascii is only 2 bytes -- better than 'normalizing' it and replacing it with a 4-byte cc_id INT.
So, no, don't bother using INT instead of TIMESTAMP.
In my experience, 2/3 of tables have a "natural" PK and don't need an auto_increment PK.
One of the worst places to use a auto_inc is on a many-to-many mapping table. It is likely to slow down most operations by a factor of 2.
You hinted at PRIMARY KEY(timestamp, ai):
You need to add INDEX(ai) to keep AUTO_INCREMENT happy.
It provides locality of reference for temporarily 'near' rows. But so does ai, by itself.
No, there is no practical way to reset the ai each second. (MyISAM has such, but do not use that engine.) Instead be sure to declare ai big enough to last 'forever' before overflowing.
But I can't think of a use case where there isn't a better way.
I am using a BIGINT to hold an id number that will increment from 1. In one table this will be the Primary Key and will, of course, be unique; in other tables it will be a foreign key. I'm trying to figure out whether this key will be "packed" if I set PACK_KEYS, since there will be a lot of leading zeroes.
I'm having difficulty understanding the MySQL doc for the PACK_KEYS table option in table creation. Here is the relevant quote from the doc:
When packing binary number keys, MySQL uses prefix compression:
Every key needs one extra byte to indicate how many bytes of the
previous key are the same for the next key.
The pointer to the row is stored in high-byte-first order directly
after the key, to improve compression.
This means that if you have many equal keys on two consecutive rows,
all following “same” keys usually only take two bytes (including the
pointer to the row). Compare this to the ordinary case where the
following keys takes storage_size_for_key + pointer_size (where the
pointer size is usually 4). Conversely, you get a significant benefit
from prefix compression only if you have many numbers that are the
same. If all keys are totally different, you use one byte more per
key, if the key is not a key that can have NULL values. (In this case,
the packed key length is stored in the same byte that is used to mark
if a key is NULL.)
They've lost me with "many equal keys on two consecutive rows,
all following “same” keys usually only take two bytes (including the
pointer to the row)". Can someone interpret the above doc for me, in light of what I'm trying to accomplish? E.g., for a primary key there won't be ANY "equal keys" - on two consecutive rows, on three consecutive rows, on 100 non-consecutive rows... or whatever they're driving at.
Thanks!
Chances are you do not need PACK_KEYS. I see you are using BIGINT for your PK. How many rows are you looking at having in this table eventually?? What kind of data are you storing? How do you intend to retrieve/report on it and how often?? These are things I would consider first before using this feature.
If I read that documentation correctly, it's basically stating that if you have two consecutive records with long PKs say:
PK-x: 1002350025789001
PK-y: 1002350025789002
With PACK_KEYS, PK-y now becomes something like "[pointer to PK-x]2"
It's basically a way of saying PK-2 is the same as PK-1 except for the last number which is 2... without having to rewrite/store the same refix/preceding numbers.
The gains from this are most likely only realized when you are dealing with very long PKs and will mostly be gains in storage/memory, however I would imagine there's a cost to overall performance which may or may not be noticeable depending on how much access load that table gets.
May not be worth it... I've never used this feature, and I've built some pretty heavy apps on MySQL.
hope this helps.
Good Luck
I searched Google for a question I ask myself since this morning but couldn't find any information or article about it.
I was wondering, in the following situation, to improve performance (a little % still) :
Context: I have two column : ID, AddedAt (AddedAt is the Unix Timestamp of when the row is created).
Theoretically, if you insert a new row, ID will be +1 and AddedAt will be the current time.
Now, let's say it is impossible in the current situation to have two simultaneous insert, would it be better to use AddedAt as a PK and remove the ID column ? AddedAt will be only one and unique column that does PK and UNIX Timestamp. So in the final, I will have one column instead of two.
The only bad side I see is maybe the size of the key that will be created on AddedAt since unix timestamp now's day is 10 digits.
Would it be better, in this situation ? What's your opinion ?
EDIT: What about using timestamp + ms ?
Timestamps are in seconds. While you might not have simultaneous inserts, as the world tends to speed up you might get multiple inserts in a second. Build your system to function soundly--don't use timesamps as primary keys.
Also, with statement replication sometime timestamps arent consistent across dbs... Row based replication alleviates this, but still its another reason for concern when using them.
From an good convention standpoint, Primary Keys should have some clear meaning to others outside yourself if it's anything other than just us a plain old auto incrementing id field. Generally, people expect numbers or char values for keys, not things like blobs, timestamps, datetimes, etc... This is especially true if later it's used for as a foreign key in another table, using timestamp as a foreign key can be confusing to later developers. Sure, if you have a varchar GUID field you know is unique, use it as the key. Just remember when used as a foreign key your going to eat up also quite a bit of memory if you have a huge string.
Assuming you can guarantee that two events won't occur within the same 1-second interval, then sure, you could use the timestamp field as a PK.
That being said, why are you worried about key sizes? A timestamp may be 10 digits, but its internal storage requirements is only 4 bytes. By comparison, an int is also 4 bytes, so you wouldn't be losing anything - unless you're using bigints, in which case it's 8 bytes.
Also, note that timestamp fields are subject to the y2038k problem. They're essentially unix timestamps that auto-format into a human readable date for you. If your app is going to be around for more than 26 years, then you should stick with an int/bigint, which has a wraparound range of "however fast you insert rows", not a fixed date/time.
The primary key is not only a technical thing, it is the business representation of something that makes each object represented by a row unique.
A timestamp is a unique field of your object because you cannot (in your case) insert two objects at the same time, but it is NOT the primary definition of a business object (if you had a business object called "timestamp" then yes, the time when it was inserted should be the primary key)
An ID stands for "my client has a physical id that represents him": in the past, we would give numbers to clients on papers, bills...
Never forget that computer science is not the objective per se but the means to achieve your goals.
I would leave the ID column as the primary key as there may be scenarios in which the unix timestamp will give you a value you're not expecting. One could be inserting very fast in succession returns the same timestamp, and another is if the server admin decides to monkey with the servers time settings.
Doing joins will probably much more obvious as people typically expect the primary key to be some sort of unique id, not a timestamp.
Yes of course, but performance gain will be minimal only while adding new record.
Moreover you will be forced to use timestamp for foreign_keys in all related objects.
It is worth considering only if you expect many inserts per second and a lot of records (to save storage on id column and its index), but as you said timestamp will be unique, so it's max 1 record per second :-)
A developer of mine was making an application and came up with the following schema
purchase_order int(25)
sales_number int(12)
fulfillment_number int(12)
purchase_order is the index in this table. (There are other fields but not relevant to this issue). purchase_order is a concatenation of sales_number + fulfillment.
Instead i proposed an auto_incrementing field of id.
Current format could be essentially 12-15 characters long and randomly generated (Though always unique as sales_number + fulfillment_number would always be unique).
My question here is:
if I have 3 rows each with a random btu unique ID i.e. 983903004, 238839309, 288430274 vs three rows with the ID 1,2,3 is there a performance hit?
As an aside my other argument (for those interested) to this was the schema makes little sense on the grounds of data redundancy (can easily do a SELECT CONCATENAE(sales_number,fulfillment_number)... rather than storing two columns together in a third)
The problem as I see is not with bigint vs int ( autoicrement column can be bigint as well, there is nothing wrong with it) but random value for primary key. If you use INNODB engine, primary key is at the same time a clustered key which defines physical order of data. Inserting random value can potentially cause more page splits, and, as a result a greater fragmentation, which in turn causes not only insert/update query to slow down, but also selects.
Your argument about concatenating makes sense, but executing CONCATE also has its cost(unfortunately, mysql doesn't support calculated persistent columns, so in some cases it's ok to store result of concatenation in a separate column; )
AFAIK integers are stored and compared as integers so the comparisons should take the same length of time.
Concatenating two ints (32bit) into one bigint (64bit) may have a performance hit that is hardware dependent.
having incremental id's will put records that were created around the same time near each other on the hdd. this might make some queries faster. if this is the primary key on innodb or for the index that these id's are used.
incremental records can sometimes be inserted a little bit quicker. test to see.
you'll need to make sure that the random id is unique. so you'll need an extra lookup.
i don't know if these points are material for you application.
My database will be storing a large number of data points, so I am using an unsigned BIGINT as the primary key.
Would it ever make sense to use a DATETIME object as the primary key?
Thanks,
Yes if course it makes sense for a date/time to be a key or part of a key if you need to uniquely identify discrete points or periods of time. I can't say if that applies to your scenario but as a general rule there's no fundamental reason why keys can't be based on time - almost any data warehouse does it.
No because it can't be guaranteed to be unique. Stick with BIGINT. You can put a nice index on the DateTime for querying and it will be good enough.
It wouldn't make sense, as you would be limited to one record per second without any actual reason for that.
It makes sense if your data comes from a single time-ordered set. Say, a record of financial transactions. If you have multiple data points which naturally occurred at different instants, but have the same timestamp due to rounding, change the low-order bits to discriminate them.
This is more problematic in MySQL than in other databases, because timestamps are stored with only 1-second precision. (Edit: as of 5.6.4, MySQL has microseconds precision on time types)
If you happen to have multiple observations per second this will fail. For this reason it's probably better not to unless you can guarantee that there will never be more than one point per second.