I have a mysql database which has 3 tables that have to be joined together. I receive smaller databases that must feed this mysql database, appending the new data as I get it. The problem I have is the smaller dbs that I get are generated by an outside application and are not really meant to be used all together. Therefore, when I utilize the schema of the smaller database, I have no way to know how they all the records from the 3 tables belong together.
I was thinking about inserting a guid to serve as a primary key that I can add to the tables and insert when I insert all of the new data.
However, I am leery of using a char field (used to store the guid) as a key. Is this a valid concern, or is using char field knowing that it will always be a guid a sufficient solution? Can someone recommend a better approach?
Thanks
MySQL does not provide a GUID/UUID type, so you would need to generate the key in the code you're using to insert rows into the DB. A char(32) or char(36) (if including hyphens) is the best data type to store a GUID in lieu of an actual GUID data type.
I'm sorry I'm not 100% familiar with MYSQL, in SQL Express, there is a Unique Identifier type you can set a column to which is really a GUID. You can even set it to auto-number itself so it picks random ones.
My boss at work HATES GUIDS though, and we work with offline/online systems a lot, so he came up with another system in which each feeding database is assigned an ID (called DEPT), and whenever he inserts into the master table from one of the smaller ones, he writes its DEPT into a seperate Integer column so its easilly sortable.
To implement this, you'd make a second key (making each table the import has to be performed on a dual-key table).
Example:
PrimaryKey1 DEPT Name
1 0 Slink
2 0 Fink
3 0 Werd
1 1 Slammer
2 1 Blam
1 2 Werrr
2 2 Soda
Mysql 8.0 has the UUID() function which should do the job.
Unfortunately, there was a bug which means you could not use it in a DEFAULT expression. But apparently, it is fixed in 8.0.19, as stated in this answer
Some DB like ms sql server provide a guid data type, I am not sure about mysql.
In general the is no problem with char or varchar as primary key unless they are too long. Normally integers are preferred because they are a bit faster but it depends if this matters for you much.
Effectively you could also use a composite primary key. One component could be your original primary key which is unique in one db only. The second component could be the number of the database, if you are able to assign a unique number to each database.
You could also use the following scheme:
int newId = dbNumber * 10000 + iDInSmallDb;
So the last 4 digits are the original db, the other digits are the db number.
Related
I have a multiple tables which store 100 million+ rows of data each. There are only a few possible unique values for any given column, so many of the columns have duplicate values.
When I initially designed the schema I decided to use secondary linked tables to store the actual values, in order to optimise the storage space required for the database.
For example:
Instead of a table for storing user agents like this:
id (int)
user_agent (varchar)
I am using 2 tables like this:
Table 1
id (int)
user_agent_id (int)
Table 2
id (int)
user_agent (varchar)
When there are 100 million+ rows I found this schema saves a massive amount of storage space because there are only a few hundred possible user agents and those strings make up the majority of the data.
The issue I am running in to is:
Using linked tables to store so much of the string data across many different tables is adding overhead on the development side and making querying the data much slower since joins are required.
My question is:
Is there a way I can put all of the columns in a single table, and force mysql to not duplicate the storage required for columns with duplicate values? I'm beginning to think there must be some built in way to handle this type of situation but I have not found anything in my research.
If I have 10 unique values for a column and 100 million+ rows why would MySQL save every value including the duplicates fully in storage rather than just a reference to the unique values?
Thank you!
After some digging and testing I found what seems to be the best solution: creating an index and foreign key constraint using the varchar column itself, rather than using an ID field.
INNODB supports foreign keys with varchar as well as int: https://dev.mysql.com/doc/refman/5.6/en/create-table-foreign-keys.html
Here is an example:
user_agents table:
user_agent (varchar, and a unique index)
user_requests table:
id
user_agent (varchar, foreign key constraint referencing user_agents table user_agent column)
other_columns etc...
I found that when using the varchar itself as the foreign key mysql will optimise the storage on its own, and will only store 1 varchar for each unique user_agent on the disk. Adding 10 million+ user_requests rows adds very little information to the disk.
I also noticed its even more efficient than using an ID to link the tables like in the original post. MySQL seems to do some magic under the hood and can link the columns with very little info on the disk. It's at least 100x more storage efficient than storing all the strings themselves, and several times more efficient than linking using IDs. You also get all the benefit of foreign keys and cascading. No joins are required to query the columns in either direction so the queries are very quick as well!
Cheers!
If I have 10 unique values for a column and 100 million+ rows why would MySQL save every value including the duplicates fully in storage rather than just a reference to the unique values?
MySQL has no way of predicting that you will always have only 10 unique values. You told it to store a VARCHAR, so it must assume you want to store any string. If it were to use a number to enumerate all possible strings, that number would actually need to be longer than the string itself.
To solve your problem, you can optimize storage by using a numeric ID referencing a lookup table. Since the number of distinct strings in your lookup table is in the hundreds, you need to use at least a SMALLINT (16-bit integer). You don't need to use a numeric as large as INT (32-bit integer).
In the lookup table, declare that id as the primary key. That should make it as quick as possible to do the joins.
If you want to do a join in the reverse directly — querying your 100M row table for a specific user agent, then index the smallint column in your large table. That will take more storage space to create the index, so make sure you need that type of query in each table before you create the index.
Another suggestion: Get a larger storage volume.
In many tutorials about MySQL they often use an ID which is made automatically when an user has made an account. Later on the ID is used to search about that profile or update that profile.
Question: Why would I use ID in MySQL when I can search with the username?
I can use the username to search in a MySQL table too, so what are the pros and cons when using an ID?
UPDATE:
Many thanks for your reactions!
So let's say a user wants to log in on a website. He will provide an username and password. But for my code I first have to do an query to know the ID, because the user don't know the ID. Is this correct or is there another way to do it?
If I would store the ID of the user in a cookie and when the user logs in then I first look if the ID is the right one with the username. And then checks if the password is correct. Then I can use the ID for queries. Is that an good idea? Of course I will use prepared statements on all of this.
Please refer to this post.
1 - It's faster. A JOIN on an integer is much quicker than a JOIN on a string field or combination of fields. It's more efficient to compare integers than strings.
2 - It's simpler. It's much easier to map relations based on a single numeric field than on a combination of other fields of varying data types.
3 - It's data-independent. If you match on the ID you don't need to worry about the relation changing. If you match on a name, what do you do if their name changes (i.e. marriage)? If you match on an address, what if someone moves?
4 - It's more efficient If you cluster on an (auto incrementing) int field, you reduce fragmentation and reduce overall size of the data set. This also simplifies indexes needed to cover your relations.
From "an ID which is made automatically" I assume you are talking about an integer column having the attribute AUTO_INCREMENT.
Several reasons a numeric auto-incremented PK is better than a string PK:
A value of type INT is stored on 4 bytes, a string uses 1 to 4 bytes for each character, depending on the charset and the character (plus 1 or 2 extra bytes that store the actual string length for VARCHAR types). Except when your string column contains only 2-3 ASCII characters, an INT always takes less space than a string; this affects the next two entries from this list.
The primary key is an index and any index is used to speed up the search of rows in the table. The search is done by comparing the searched value with the values stored in the index. Comparing integral numeric values (INT vs. INT) requires a single CPU operation; it works very fast. Comparing string values is harder: the corresponding characters from the two strings are compared taking into the account the characteristics of their encoding, collation, upper/lower case etc; usually more than one pairs of characters need to be compared; this takes a lot of CPU operations and is much slower than comparing INTs.
The InnoDB storage engine keeps a reference to the PK in every index of the table. If the PK of the table is not set or not numeric, InnoDB internally creates a numeric auto-incremented column and uses it instead (and makes the visible PK refer to it too). This means you don't waste any database space by adding an extra ID column (or you don't save anything by not adding it, if you prefer to get it the other way around).
Why does InnoDB work this way? Read the previous item again.
The PK of a table usually migrates as a FK in a related table. This means the value of the PK column of each rows from the first table is duplicated into the FK field in the related table (think of the classic example of employee that is works in a department; the department_id column of department is duplicated into the employee table). Here the column type affects both the used space and the speed (the FK is usually used for JOIN, WHERE and GROUP BY clauses in queries).
Here is one reason to do it from a lot.
If the username is really primary key for your relation using the surrogate key (ID) is at least the space optimization. In normalization process your relation can be splited to the several tables. Replacing the username(varchar 30) by ID(int) in related tables as foreign key can save a lot of space.
I am using SQL Server 2008 R2.
In a table, I do have a column of type VARCHAR(50). I am storing some alpha-numeric dat` in the same. The data are of different length i.e. inserted from user input.
Now, I do have a requirement to store NEWID() - UNIQUEIDENTIFIER type of data in the same. But I have already existing data that I can not modify. So, instead of converting the datatype of column from VARCHAR to UNIQUEIDENTIFIER I am thinking about storing NEWID() in the VARCHAR format in the same column.
Is it advisable to do the same?
If the column you are speaking of is not to be used as a primary key or index, then you will not have any issues - you can store there whatever you want.
If the column has already been an index or even primary key, you will have the same issues you already have (regarding to your post saying the data is of variable length)
If you introduce an index or primary key when making this change, then you should take into accounts the possible performance impacts of using a variable length column for this.
A good advice, if the above is innevitable, would be to attempt to reduce the size of the column to 36 (I think this was the standard length of the unique identifiers generated by most systems and frameworks, including the UNIQUEIDENTIFIER type of SQL Server, which is represented via 5 alpha-numerical entities, totally constituting of 32 symbols separated with (4) hyphens).
If your current data can be sized to 36 symbols safely, I recommend redefining the column before making it an index or PK, especially if your table has many rows already.
If possible, also redefine it as CHAR(36) (or even NCHAR(36) as advised in comments) - fixed length columns perform better as indices. Also, MS SQL server 2005 and above support the NEWSEQUENTIALID() function for generating sequential ids. These perform better for index or PK columns.
I am going to store filenames and other details in a table where, I am planning to use sha1 hash of the filename as PK.
Q1. SHA1 PK will not be a sequentially increasing/decreasing number.
so, will it be more resource consuming for the database to
maintain/search_into and index on that key? If i decide to keep it in database as 40 char value.
Q2. I read here:
https://stackoverflow.com/a/614483/986818 storing the data as
binary(20) field. can someone advice me in this regard:
a) do i have to create this column as: TYPE=integer, LENGTH=20,
COLLATION=binary, ATTRIBUTES=binary?
b) how to convert the sha1 value in MySQL or Perl to store into the
table?
c) is there a danger of duplicacy for this 20 char value?
**
---------UPDATE-------------
**
The requirement is to search the table on filename. user supplies filename, i go search the table and if filename is not there adds it. So either i index on varchar(100) filename field or generate a column with sha1 of the filename - hoping it would be easy for indexing for MySql compared to indexing a varchar field. Also i can search using the sha1 value from my program against the sha1 column. what say? primary key or just indexd key: i choose PK coz DBIx likes using PK. and PK or INDEX+UNIQ would be same amount of overhead for the system(so i thought)
Ok, then use a very -short- hash on the filename and accept collisions. Use an integer type for it (thats much faster!!!). E.g. you can use md5(filename) and then use the first 8 characters and convert them to an integer. SQL could look like this:
CREATE TABLES files (
id INT auto_increment,
hash INT unsigned,
filename VARCHAR(100),
PRIMARY KEY(id),
INDEX(hash)
);
Then you can use:
SELECT id FROM files WHERE hash=<hash> AND filename='<filename>';
The hash is then used for sorting out most other files (normally all other files) and then the filename is for selecting the right entry out of the few hash collisions.
For generating an integer hash-key in perl I suggest using md5() and pack().
If i decide to keep it in database as 40 char value.
Using a character sequence as a key will degrade performance for obvious reasons.
Also the PK is supposed to be unique. Although it will be probably be unlikely that you end up with collisions (theoretically using that for a function to create the PK seems inappropriate.
Additionally anyone knowing the filename and the hash you use, would know all your database ids. I am not sure if this is something not to consider.
Q1: Yes, it will need to build up a B-Tree of nodes that contain not only 1 Integer (4 Bytes) but a CHAR(40). Speed would be aproximately the same, as long the INDEX is kept in memory. As the entries are about 10 times bigger, you need 10 times more memory to keep it in memory. BUT: You probably want to lookup by the Hash anyway. So you'll need to have it either as Primary key OR as an Index.
Q2: Just create a Table field like CREATE TABLE test (ID BINARY(40), ...); later you can use INSERT INTO test (ID, ..) VALUES (UNHEX('4D7953514C'), ...);
-- Regarding: Is there a danger of duplicacy for this 20 char value?
The chance is 1 in 2^(8*20). 1 in 1,46 * 10^48 ... or 1 of 14615016373309029182036848327163*10^18. So the chance for that is very very v.. improbable.
There is no reason to use a cryptographically secure hash here. Instead, if you do this, use an ordinary hash. See here: https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
The hash is NOT a 40 char value! It's a 160 bit number, and you should store it that way (as a 20 char binary field). Edit: I see you mentioned that in comment 2. Yes, you should definitely do that. But I can't tell you how since I don't know what programming language you are using. Edit2: I see it's perl - sorry I don't know how to convert it in perl, but look for "pack" functions.
No, do not create it as type integer. The maximum integer is 128 bits which doesn't hold the entire thing. Although you could really just truncate it to 128 bits without real harm.
It's better to use a simpler hash anyway. You could risk it and ignore collisions, but if you do it properly you kinda of have to handle them.
I would stick with the standard auto-incrementing integer for the primary key. If uniqueness of file names is important (which it sounds like it is), then you can add a UNIQUE constraint on the file name itself or some derived, canonical version of the file name. Most languages/frameworks have some sort of method for getting a canonical version of a path (relative to absolute, standardized case, etc).
If you implement my suggestion or pursue your original plan, then you should be aware that multiple strings can map to the same filename/path. Both versions will have different hashes/pass the uniqueness constraint but will actually both refer to the same file. This depends on operating system and may or may not be a problem for you. Just something to keep in mind.
I am developing a logging database, the ids of the components being logged in this case are not determined by the database itself, but by the system that sends the report. The system id is a unique varchar, and the component's id is determined by the system (in some faraway location), so uniqueness is guaranteed when the component's primary key is system_id + component_id.
What I'm wondering is if this approach is going to be efficient. I could use auto incremented integers as the id, but that would mean I would have to do select operations before inserting so that I can get this generated id instead of using the already known string id that the system provides.
The database is going to be small scale, no more than a few dozen systems, each with a few dozen components, and and maybe some thousands of component updates (another table). Old updates will be periodically dumped into a file and removed from the database, so it won't ever get "big."
Any recomendations?
I would lean towards auto incremented integers as a primary key and put indexes on system_id and component_id. Your selects before that insert will be very cheap and fast.
I'm sure you'll find that tables of several million rows will perform fine with varchar() keys.
It's easy enough to test. Just import your data.