I am developing a logging database, the ids of the components being logged in this case are not determined by the database itself, but by the system that sends the report. The system id is a unique varchar, and the component's id is determined by the system (in some faraway location), so uniqueness is guaranteed when the component's primary key is system_id + component_id.
What I'm wondering is if this approach is going to be efficient. I could use auto incremented integers as the id, but that would mean I would have to do select operations before inserting so that I can get this generated id instead of using the already known string id that the system provides.
The database is going to be small scale, no more than a few dozen systems, each with a few dozen components, and and maybe some thousands of component updates (another table). Old updates will be periodically dumped into a file and removed from the database, so it won't ever get "big."
Any recomendations?
I would lean towards auto incremented integers as a primary key and put indexes on system_id and component_id. Your selects before that insert will be very cheap and fast.
I'm sure you'll find that tables of several million rows will perform fine with varchar() keys.
It's easy enough to test. Just import your data.
Related
I think this question has been asked in some way shape or form but I couldn't find a question that had asked exactly what I wish to understand so I thought I'd put the question here
Problem statement
I have built a web application with a MySQL database of say customer records with an INT(11) id PK AI field and a VARCHAR(255) uuid field. The uuid field is not indexed nor set as unique. The uuid field is used as a public identifier so its part of URLs etc. - e.g. https://web.com/get_customer/[uuid]. This was done because the UUID is 'harder' to guess for a regular John Doe - but understand that it is certainly not 'unguessable' in theory. But the issue now is that as the database is growing larger I have observed that the query to retrieve a particular customer record is taking longer to complete.
My thoughts on how to solve the issue
The solution that is coming to mind is to make the uuid field unique and also index the same. But I've been doing some reading in relation to this and various blog posts, StackOverflow answers on this have described putting indices on UUIDs as being really bad for performance. I also read that it will also increase the time it takes to insert a new customer record into the database as the MySQL database will take time to find the correct location in which to place the record as a part of the index.
The above mentioned https://web.com/get_customer/[uuid] can be accessed without having to authenticate which is why I'm not using the id field for the same. It is possible for me to consider moving to integer based UUIDs (I don't need the UUIDs to be universally unique - they just need to be unique for that particular table) - will that improve the the indicing performance and in turn the insertion and querying performance?
Is there a good blog post or information page on how to best set up a database for such a requirement - Need the ability to store a customer record which is 'hard' to guess, easy to insert and easy to query in a large data set.
Any assistance is most appreciated. Thank you!
The received wisdom you mention about putting indexes on UUIDs only comes up when you use them in place of autoincrementing primary keys. Why? The entire table (InnoDB) is built behind the primary key as a clustered index, and bulk loading works best when the index values are sequential.
You certainly can put an ordinary index on your UUID column. If you want your INSERT operations to fail in the astronomically unlikely event you get a random duplicate UUID value you can use an index like this.
ALTER TABLE customer ADD UNIQUE INDEX uuid_constraint (uuid);
But duplicate UUIDv4s are very rare indeed. They have 122 random bits, and most software generating them these days uses cryptographic-quality random number generators. Omitting the UNIQUE index is, I believe, an acceptable risk. (Don't use UUIDv1, 2, 3, or 5: they're not hard enough to guess to keep your data secure.)
If your UUID index isn't unique, you save time on INSERTs and UPDATEs: they don't need to look at the index to detect uniqueness constraint violations.
Edit. When UUID data is in a UNIQUE index, INSERTs are more costly than they are in a similar non-unique index. Should you use a UNIQUE index? Not if you have a high volume of INSERTs. If you have a low volume of INSERTs it's fine to use UNIQUE.
This is the index to use if you omit UNIQUE:
ALTER TABLE customer ADD UNIQUE INDEX uuid (uuid);
To make lookups very fast you can use covering indexes. If your most common lookup query is, for example,
SELECT uuid, givenname, surname, email
FROM customer
WHERE uuid = :uuid
you can create this so-called covering index.
ALTER TABLE customer
ADD INDEX uuid_covering (uuid, givenname, surname, email);
Then your query will be satisfied directly from the index and therefore be faster.
There's always an extra cost to INSERT and UPDATE operations when you have more indexes. But the cost of a full table scan for a query is, in a large table, far far greater than the extra INSERT or UPDATE cost. That's doubly true if you do a lot of queries.
In computer science there's often a space / time tradeoff. SQL indexes use space to save time. It's generally considered a good tradeoff.
(There's all sorts of trickery available to you by using composite primary keys to speed things up. But that's a topic for when you have gigarows.)
(You can also save index and table space by storing UUIDs in BINARY(16) columns and use UUID_TO_BIN() and BIN_TO_UUID() functions to convert them. )
I have table which stores user files e.g images. It has an auto increment primary key and so it's easy to guess via a url what the next/previous id is e.g mydomain/file/12. Whilst i have security in place to prevent unauthorised user from accessing someone else files i'd prefer to have a more complex url id which is difficult to guess.
The table will have a a lot of inserts/deletes so I've stuck to using a auto increment id for the primary key as opposed to using a uuid as a primary key due to it's associated performance issues.
So i was thinking of adding an additional column called uuid which i could use to retrieve files. Whilst mysql docs state that uuid's are designed as a number that is globally unique in space and time would i still need to implement a unique index on this column since there would be no database mechanism to prevent a collision - if it ever occurred?
Any advice appreciated.
If you want to insist that the column be unique, then create a unique index/constraint on it.
This will prevent manual inserts and updates from duplicating an existing column, even if the automatic mechanism generates a guaranteed-unique value on "normal" inserts.
That said, if performance is of paramount concern, then you might decide to -- essentially -- disable "manual" inserts and forego the unique index. That would be a compromise based on performance needs.
If I set the primary key to be INT type (AUTO_INCREMENT) or set it in UUID, what is the difference between these two in the database performance (SELECT, INSERT etc) and why?
UUID returns a universal unique identifier (hopefuly also unique if imported to another DB as well).
To quote from MySQL doc (emphasis mine):
A UUID is designed as a number that is globally unique in space and
time. Two calls to UUID() are expected to generate two different
values, even if these calls are performed on two separate computers
that are not connected to each other.
On the other hand a simply INT primary id key (e.g. AUTO_INCREMENT) will return a unique integer for the specific DB and DB table, but which is not universally unique (so if imported to another DB chances are there will be primary key conflicts).
In terms of performance, there shouldn't be any noticeable difference using auto-increment over UUID. Most posts (including some by the authors of this site), state as such. Of course UUID may take a little more time (and space), but this is not a performance bottleneck for most (if not all) cases. Having a column as Primary Key should make both choices equal wrt to performance. See references below:
To UUID or not to UUID?
Myths, GUID vs Autoincrement
Performance: UUID vs auto-increment in cakephp-mysql
UUID performance in MySQL?
Primary Keys: IDs versus GUIDs (coding horror)
(UUID vs auto-increment performance results, adapted from Myths, GUID vs Autoincrement)
UUID pros / cons (adapted from Primary Keys: IDs versus GUIDs)
GUID Pros
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway
GUID Cons
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if
you're not careful
Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of
clustered indexes.
Note
I would read carefully the mentioned references and decide whether to use UUID or not depending on my use case. That said, in many cases UUIDs would be indeed preferable. For example one can generate UUIDs without using/accessing the database at all, or even use UUIDs which have been pre-computed and/or stored somewhere else. Plus you can easily generalise/update your database schema and/or clustering scheme without having to worry about IDs breaking and causing conflicts.
In terms of possible collisions, for example using v4 UUIDS (random), the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion.
A UUID key cannot be pk until unless persisted in DB so round tripping will happen until then you cannot assume its pk without a successful transaction. Most of the UUID use time based, mac based, name based or some random uuid. Given we are moving heavily towards container based deployments and they have a pattern for starting sequence MAC addresses relying on mac addresses will not work. Time based is not going to guarantee as the assumption is systems are always in exact time sync which is not true sometimes as clocks will not follow the rules. GUID cannot guarantee that collision will never occur just that in given short period of time it will not occur but given enough time and systems running in parallel and proliferations of systems that guarantee will eventually fail.
http://www.ietf.org/rfc/rfc4122.txt
For MySQL, which uses clustered primary key, version 4 randomly generated UUID will hurt insertion performance if used as the primary key. This is because it requires reordering the rows to place the newly inserted row at the right position inside the clustered index.
FWIW, PostgreSQL uses heap instead of clustered primary key, thus using UUID as the primary key won't impact PostgreSQL's insertion performance.
For more information, this article has a more comprehensive comparison between UUID and Int: Choose Primary Key - UUID or Auto Increment Integer
We have built an application with MySQL as the database. Every week we export the data dump from the database, and delete all the data. Now we want to merge all these dumps together for some data-analysis tasks.
The problem we are facing is that the "id" field for all the tables is Auto-Increment, so it starts with 1 in all the data dumps, which causes duplicate IDs in the table. I am sure there must be better ways to do it since it should be a pretty common task in MySQL administration.
What would be the best way to go about it?
If you can easily identify your foreign key fields (like they take the form *_id) then you can use the scripting language of your choice to modify the primary and foreign keys in the dump files by adding an "id space offset".
For example let's say you have two dump files and you know their primary key range does not exceed 1,000,000, you increment the primary and foreign keys in the second dump file by 1,000,000.
This is not entirely trivial to implement, as you will have to detect the position of the foreign key fields in the statements and then modify values at the same column position elsewhere in the statement.
If your foreign keys are not easily identifiable by a common naming convention then you must keep separate information per table about how to find their positions based on column position.
Good luck.
The best way would be that you have another database that acts as data warehouse into which you copy the contents of your app's database. After that, you don't truncate all the tables, you simply use DELETE FROM tablename - that way, your auto_increments won't get reset.
It's an ugly solution to have something exported, then truncate the database, then expect an import will proceed properly. Even if you go around the problem of clashing auto increments (there's ON DUPLICATE KEY statement that allows you to do something if a unique key constraint fails), nothing guarantees that relations between tables (foreign keys) will be preserved.
This is a broad topic and solution given is quick and not nice, some other people will probably suggest other methods, but if you are doing this to offload the db your app uses - it's a bad design. Try to google MySQL's partitioning support if you're aiming for better performance with larger data set.
For the data you've already dumped, load it into a table that doesn't use the ID column as a primary key. You don't have to define any primary key. You will have multiple rows with the same ID, but that won't impede your data analysis.
Going forward, you can set up a discipline where you dump and then DELETE the rows that are more than, say, one day old. That way the your ID will keep incrementing.
Or, you can copy this data to a table that uses the ARCHIVE storage engine. This is good for retaining data for analysis, because it compresses its contents.
I have a mysql database which has 3 tables that have to be joined together. I receive smaller databases that must feed this mysql database, appending the new data as I get it. The problem I have is the smaller dbs that I get are generated by an outside application and are not really meant to be used all together. Therefore, when I utilize the schema of the smaller database, I have no way to know how they all the records from the 3 tables belong together.
I was thinking about inserting a guid to serve as a primary key that I can add to the tables and insert when I insert all of the new data.
However, I am leery of using a char field (used to store the guid) as a key. Is this a valid concern, or is using char field knowing that it will always be a guid a sufficient solution? Can someone recommend a better approach?
Thanks
MySQL does not provide a GUID/UUID type, so you would need to generate the key in the code you're using to insert rows into the DB. A char(32) or char(36) (if including hyphens) is the best data type to store a GUID in lieu of an actual GUID data type.
I'm sorry I'm not 100% familiar with MYSQL, in SQL Express, there is a Unique Identifier type you can set a column to which is really a GUID. You can even set it to auto-number itself so it picks random ones.
My boss at work HATES GUIDS though, and we work with offline/online systems a lot, so he came up with another system in which each feeding database is assigned an ID (called DEPT), and whenever he inserts into the master table from one of the smaller ones, he writes its DEPT into a seperate Integer column so its easilly sortable.
To implement this, you'd make a second key (making each table the import has to be performed on a dual-key table).
Example:
PrimaryKey1 DEPT Name
1 0 Slink
2 0 Fink
3 0 Werd
1 1 Slammer
2 1 Blam
1 2 Werrr
2 2 Soda
Mysql 8.0 has the UUID() function which should do the job.
Unfortunately, there was a bug which means you could not use it in a DEFAULT expression. But apparently, it is fixed in 8.0.19, as stated in this answer
Some DB like ms sql server provide a guid data type, I am not sure about mysql.
In general the is no problem with char or varchar as primary key unless they are too long. Normally integers are preferred because they are a bit faster but it depends if this matters for you much.
Effectively you could also use a composite primary key. One component could be your original primary key which is unique in one db only. The second component could be the number of the database, if you are able to assign a unique number to each database.
You could also use the following scheme:
int newId = dbNumber * 10000 + iDInSmallDb;
So the last 4 digits are the original db, the other digits are the db number.