Question about how foreign key data is stored in SQL - mysql

I know this is ultra-basic, but it's an assumption I've always held and would like to validate that it's true (in general, with the details specific to various implementations)
Let's say I have a table that has a text column "Fruit". In that column only one of four values ever appears: Pear, Apple, Banana, and Strawberry. I have a million rows.
Instead of repeating that data (on average) a quarter million times each, if I extract it into a another table that has a Fruit column and just those four rows, and then make the original column a foreign key, does it save space?
I assume that the four fruit names are stored only once, and that the million rows now have pointers or indexes or some kind of reference into the second table.
If my row values are longer than short fruit names I assume the savings/optimization is even larger.

The data types of the fields on both sides of a foreign key relationship have to be identical.
If the parent table's key field is (say) varchar(20), then the foreign key fields in the dependent table will also have to be varchar(20). Which means, yes, you'd have to have X million rows of 'Apple' and 'Pear' and 'Banana' repeating in each table which has a foreign key pointing back at the fruit table.
Generally it's more efficient to use numeric fields as keys (int, bigint), as those can have comparisons done with very few CPU instructions (generally a direct one cpu instruction comparison is possible). Strings, on the other hand, require loops and comparatively expensive setups. So yes, you'd be better off to store the fruit names in a table somewhere, and use their associated numeric ID fields as the foreign key.
Of course, you should benchmark both setups. These are just general rules of thumbs, and your specific requirements/setup may actually work faster with the strings-as-key version.

That is correct.
You should have
table fruits
id name
1 Pear
2 Apple
3 Banana
4 Strawberry
Where ID is a primary key.
In your second table you will use just the id of this table. That will save you physical space and will make your select statements work faster.
Besides, this structure would make it very easy for you to add new fruits.

Instead of repeating that data (on average) a quarter million times
each, if I extract it into a another table that has a Fruit column and
just those four rows, and then make the original column a foreign key,
does it save space?
No if the "Fruit" is the PRIMARY KEY of the "lookup" table, so it must also be the FOREIGN KEY in the "large" table.
However if you make a small surrogate PRIMARY KEY (such as integer "id") in the "lookup" table and than use that as the FOREIGN KEY in the "large" table, you'll save space.

At first yes it will save space because int - 4 bytes, TINYINT - 1 byte. Secondly, searching by this field with TYPE INT will be faster than by VARCHAR. In addition to this, you can use ENUM if your data doesn't change in future. With enum you will get the same maybe faster result than with secondary table and you will avoid additional join.

Normalization is not just about space, it's often about redundancy and modelling the data behavior and also about updating just one row for a change - and reducing the scope of locks by updating only the minimal amount of data.

Sadly, you assume wrong: the values are physically stored repeatedly for each referencing table. Some SQL products do store the value just once but most don't, notably the more popular ones which are based on contiguous storage on disk.
This is the reason end users feel the need to implement their own points in the guise of use integer 'surrogate keys'. A system surrogate would be preferable e.g. wouldn't be visible to users, in the same way an index's 'values' are maintained by the system and cannot be manipulated directly by users. The problem with rolling your own is they become part of the logical model.

Related

Optimising Storage Space: Many rows & columns with the same values

I have a multiple tables which store 100 million+ rows of data each. There are only a few possible unique values for any given column, so many of the columns have duplicate values.
When I initially designed the schema I decided to use secondary linked tables to store the actual values, in order to optimise the storage space required for the database.
For example:
Instead of a table for storing user agents like this:
id (int)
user_agent (varchar)
I am using 2 tables like this:
Table 1
id (int)
user_agent_id (int)
Table 2
id (int)
user_agent (varchar)
When there are 100 million+ rows I found this schema saves a massive amount of storage space because there are only a few hundred possible user agents and those strings make up the majority of the data.
The issue I am running in to is:
Using linked tables to store so much of the string data across many different tables is adding overhead on the development side and making querying the data much slower since joins are required.
My question is:
Is there a way I can put all of the columns in a single table, and force mysql to not duplicate the storage required for columns with duplicate values? I'm beginning to think there must be some built in way to handle this type of situation but I have not found anything in my research.
If I have 10 unique values for a column and 100 million+ rows why would MySQL save every value including the duplicates fully in storage rather than just a reference to the unique values?
Thank you!
After some digging and testing I found what seems to be the best solution: creating an index and foreign key constraint using the varchar column itself, rather than using an ID field.
INNODB supports foreign keys with varchar as well as int: https://dev.mysql.com/doc/refman/5.6/en/create-table-foreign-keys.html
Here is an example:
user_agents table:
user_agent (varchar, and a unique index)
user_requests table:
id
user_agent (varchar, foreign key constraint referencing user_agents table user_agent column)
other_columns etc...
I found that when using the varchar itself as the foreign key mysql will optimise the storage on its own, and will only store 1 varchar for each unique user_agent on the disk. Adding 10 million+ user_requests rows adds very little information to the disk.
I also noticed its even more efficient than using an ID to link the tables like in the original post. MySQL seems to do some magic under the hood and can link the columns with very little info on the disk. It's at least 100x more storage efficient than storing all the strings themselves, and several times more efficient than linking using IDs. You also get all the benefit of foreign keys and cascading. No joins are required to query the columns in either direction so the queries are very quick as well!
Cheers!
If I have 10 unique values for a column and 100 million+ rows why would MySQL save every value including the duplicates fully in storage rather than just a reference to the unique values?
MySQL has no way of predicting that you will always have only 10 unique values. You told it to store a VARCHAR, so it must assume you want to store any string. If it were to use a number to enumerate all possible strings, that number would actually need to be longer than the string itself.
To solve your problem, you can optimize storage by using a numeric ID referencing a lookup table. Since the number of distinct strings in your lookup table is in the hundreds, you need to use at least a SMALLINT (16-bit integer). You don't need to use a numeric as large as INT (32-bit integer).
In the lookup table, declare that id as the primary key. That should make it as quick as possible to do the joins.
If you want to do a join in the reverse directly — querying your 100M row table for a specific user agent, then index the smallint column in your large table. That will take more storage space to create the index, so make sure you need that type of query in each table before you create the index.
Another suggestion: Get a larger storage volume.

Index every column to add foreign keys

I am currently learning about foreign keys and trying to add them as much as I can in my application to ensure data-integrity. I am using INNODB on Mysql.
My clicks table has a structure something like...
id, timestamp, link_id, user_id, ip_id, user_agent_id, ... etc for about 12 _id columns.
Obviously these all point to other tables, so should I add a foreign key on them? MySQL is creating an index automatically for every foreign key, so essentially I'll have an index on every column? Is this what I want?
FYI - this table will essentially be my most bulky table. My research basically tells me I'm sacrificing performance for integrity but doesn't suggest how harsh the performance drop will be.
Right before inserting such a row, you did 12 inserts or lookups to get the ids, correct? Then, as you do the INSERT, it will do 12 checks to verify that all of those ids have a match. Why bother; you just verified them with the code.
Sure, have FKs in development. But in production, you should have weeded out all the coding mistakes, so FKs are a waste.
A related tip -- Don't do all the work at once. Put the raw (not-yet-normalized) data into a staging table. Periodically do bulk operations to add new normalization keys and get the _id's back. Then move them into the 'real' table. This has the added advantage of decreasing the interference with reads on the table. If you are expecting more than 100 inserts/second, let's discuss further.
The generic answer is that if you considered a data item so important that you created a lookup table for the possible values, then you should create a foreign key relationship to ensure you are not getting any orphan records.
However, you should reconsider, whether all data items (fields) in your clicks table need a lookup table. For example ip_id field probably represents an IP address. You can simply store the IP address directly in the clicks table, you do not really need a lookup table, since IP addresses have a wide range and the IP addresses are unique.
Based on the re-evaluation of the fields, you may be able to reduce the number of related tables, thus the number of foreign keys and indexes.
Here are three things to consider:
What is the ratio of reads to writes on this table? If you are reading much more often than writing, then more indexes could be good, but if it is the other way around then the cost of maintaining those indexes becomes harder to bear.
Are some of the foreign keys not very selective? If you have an index on the gender_id column then it is probably a waste of space. My general rule is that indexes without included columns should have about 1000 distinct values (unless values are unique) and then tweak from there.
Are some foreign keys rarely or never going to be used as a filter for a query? If you have a last_modified_user_id field but you never have any queries that will return a list of items which were last modified by a particular user then an index on that field is less useful.
A little bit of knowledge about indexes can go a long way. I recommend http://use-the-index-luke.com

Why would I use ID in MySQL when I can search with the username?

In many tutorials about MySQL they often use an ID which is made automatically when an user has made an account. Later on the ID is used to search about that profile or update that profile.
Question: Why would I use ID in MySQL when I can search with the username?
I can use the username to search in a MySQL table too, so what are the pros and cons when using an ID?
UPDATE:
Many thanks for your reactions!
So let's say a user wants to log in on a website. He will provide an username and password. But for my code I first have to do an query to know the ID, because the user don't know the ID. Is this correct or is there another way to do it?
If I would store the ID of the user in a cookie and when the user logs in then I first look if the ID is the right one with the username. And then checks if the password is correct. Then I can use the ID for queries. Is that an good idea? Of course I will use prepared statements on all of this.
Please refer to this post.
1 - It's faster. A JOIN on an integer is much quicker than a JOIN on a string field or combination of fields. It's more efficient to compare integers than strings.
2 - It's simpler. It's much easier to map relations based on a single numeric field than on a combination of other fields of varying data types.
3 - It's data-independent. If you match on the ID you don't need to worry about the relation changing. If you match on a name, what do you do if their name changes (i.e. marriage)? If you match on an address, what if someone moves?
4 - It's more efficient If you cluster on an (auto incrementing) int field, you reduce fragmentation and reduce overall size of the data set. This also simplifies indexes needed to cover your relations.
From "an ID which is made automatically" I assume you are talking about an integer column having the attribute AUTO_INCREMENT.
Several reasons a numeric auto-incremented PK is better than a string PK:
A value of type INT is stored on 4 bytes, a string uses 1 to 4 bytes for each character, depending on the charset and the character (plus 1 or 2 extra bytes that store the actual string length for VARCHAR types). Except when your string column contains only 2-3 ASCII characters, an INT always takes less space than a string; this affects the next two entries from this list.
The primary key is an index and any index is used to speed up the search of rows in the table. The search is done by comparing the searched value with the values stored in the index. Comparing integral numeric values (INT vs. INT) requires a single CPU operation; it works very fast. Comparing string values is harder: the corresponding characters from the two strings are compared taking into the account the characteristics of their encoding, collation, upper/lower case etc; usually more than one pairs of characters need to be compared; this takes a lot of CPU operations and is much slower than comparing INTs.
The InnoDB storage engine keeps a reference to the PK in every index of the table. If the PK of the table is not set or not numeric, InnoDB internally creates a numeric auto-incremented column and uses it instead (and makes the visible PK refer to it too). This means you don't waste any database space by adding an extra ID column (or you don't save anything by not adding it, if you prefer to get it the other way around).
Why does InnoDB work this way? Read the previous item again.
The PK of a table usually migrates as a FK in a related table. This means the value of the PK column of each rows from the first table is duplicated into the FK field in the related table (think of the classic example of employee that is works in a department; the department_id column of department is duplicated into the employee table). Here the column type affects both the used space and the speed (the FK is usually used for JOIN, WHERE and GROUP BY clauses in queries).
Here is one reason to do it from a lot.
If the username is really primary key for your relation using the surrogate key (ID) is at least the space optimization. In normalization process your relation can be splited to the several tables. Replacing the username(varchar 30) by ID(int) in related tables as foreign key can save a lot of space.

Table without a primary key

So I've always been told that it's absolutely necessary to have a primary key specified with a table. I've been doing some work and ran into a situation where a primary key's unique constraint would stop data I need from being added.
If there's an example situation where a table was structured with fields:
Age, First Name, Last Name, Country, Race, Gender
Where if a TON of data was being entered all these fields don't necessarily uniquely identify a row and I don't need an index across all columns anyways. Would the only solution here be to make an auto-incrementing ID field? Would it be okay to NOT have a primary at all?
It's not always necessary to have a primary key, most DBMS' will allow you to construct a table without one (a).
But that doesn't necessarily mean it's a good idea. Have a think about the situation in which you want to use that data. Now think about if you have two twenty-year-old Australian men named Bob Smith, both from Perth.
Without a unique constraint, you can put both rows into the table but her's the rub. How would you figure out which one you want to use in future? (b)
Now, if you just want to store the fact that there are one or more people meeting those criteria, you only need to store one row. But then, you'd probably have a composite primary key consisting of all columns.
If you have other information you want to store about the person (e.g., highest score in the "2048" game on their iPhone), then you don't want a primary key across the entire row, just across the columns you mention.
Unfortunately, that means there will undoubtedly come a time when both of those Bob Smith's try to write their high score to the database, only to find one of them loses their information.
If you want them both in the table and still want to allow for the possibility outlined above (two people with identical attributes in the columns you mention) then the best bet is to introduce an artificial key such as an auto-incrementing column, for the primary key. That will allow you to uniquely identify a row regardless of how identical the other columns are.
The other advantage of an artificial key is that, being arbitrary, it never needs to change for the thing being identified. In your example, if you use age, names, nationality or location (c) in your primary key, these are all subject to change, meaning that you will need to adjust any foreign keys referencing those rows. If the tables referencing these rows uses the unchanging artificial key, that will never be a problem.
(a) There are situations where a primary key doesn't really give you any performance benefit such as when the table is particularly small (such as mapping integers 1 through 12 to month name).
In other words, things where a full table scan isn't really any slower than indexing. But these situations are incredibly rare and I'd probably still use a key because it's more consistent (especially since the use of a key tends not to make a difference to the performance either way).
(b) Keep in mind that we're talking in terms of practice here rather than theory. While in practice you may create a table with no primary key, relational theory states that each row must be uniquely identifiable, otherwise relations are impossible to maintain.
C.J. Date who, along with Codd, is one of the progenitors of relational database theory, states the rules of relational tables in "An introduction to Database Systems", one of which is:
The records have a unique identifier field or field combination called the primary key.
So, in terms of relational theory, each table must have a primary key, even though it's not always required in practice.
(c) Particularly age which is guaranteed to change annually until you're dead, so perhaps date of birth may be a better choice for that column.
Would the only solution here be to make an auto-incrementing ID field?
That is a valid way, but it is not the only one: you could use other ways to generate unique keys, such as using GUIDs. Keys like that are called surrogate primary keys, because they are not related to the "payload" of the data row.
Would it be okay to NOT have a primary at all?
Since you mentioned that the actual data in rows may not be unique, you wouldn't be able to use your table effectively without a primary key. For example, you would not be able to update or delete a specific row, which may be required, for example, when a user's name changes.
The most simple solution would be to include an ID column to serve as primary key:
id int not null primary key auto_increment
From your post it looks like the table representing a person entity. In that case, wouldn't having a PK would determine each person entity uniquely. I would suggest, having a primary key on the table which will uniquely determine each person record.
You can either create a AUTO_INCREMENT ID column (a synthetic ID column)
(OR)
You can combine multiple columns in your table which can uniquely determine all the other fields like (First Name, Last Name) probably which will make it a composite primary key but that may clash as well since there could be more than one person having same full name (first name + last name).
Typically you should avoid proliferating ID primary keys fields through your database.
Now, that doesn't mean you shouldn't have primary keys, your primary key can be a surrogate or a composed key. And that's what you should do here.
If those fields {Age, First Name, Last Name, Country, Race, Gender}, identify unequivocally each row, then make a primary key composed by all of those fields.
But if not, then you must have some other type of information to disambiguate your data.
You can also, not specify any kind of key, and assume that table as non-normalized, and redundant data source... if this is what you need...!
Use an identity column with another column such as Last Name

What are the merits of using numeric row IDs in MySQL?

I'm new to SQL, and thinking about my datasets relationally instead of hierarchically is a big shift for me. I'm hoping to get some insight on the performance (both in terms of storage space and processing speed) versus design complexity of using numeric row IDs as a primary key instead of string values which are more meaningful.
Specifically, this is my situation. I have one table ("parent") with a few hundred rows, for which one column is a string identifier (10-20 characters) which would seem to be a natural choice for the table's primary key. I have a second table ("child") with hundreds of thousands (or possibly millions or more) of rows, where each row refers to a row in the parent table (so I could create a foreign key constraint on the child table). (Actually, I have several tables of both types with a complex set of references among them, but I think this gets the point across.)
So I need a column in the child table that gives an identifier to rows in the parent table. Naively, it seems like creating the column as something like VARCHAR(20) to refer to the "natural" identifier in the first table would lead to a huge performance hit, both in terms of storage space and query time, and therefore I should include a numeric (probably auto_increment) id column in the parent table and use this as the reference in the child. But, as the data that I'm loading into MySQL don't already have such numeric ids, it means increasing the complexity of my code and more opportunities for bugs. To make matters worse, since I'm doing exploratory data analysis, I may want to muck around with the values in the parent table without doing anything to the child table, so I'd have to be careful not to accidentally break the relationship by deleting rows and losing my numeric id (I'd probably solve this by storing the ids in a third table or something silly like that.)
So my question is, are there optimizations I might not be aware of that mean a column with hundreds of thousands or millions of rows that repeats just a few hundred string values over and over is less wasteful than it first appears? I don't mind a modest compromise of efficiency in favor of simplicity, as this is for data analysis rather than production, but I'm worried I'll code myself into a corner where everything I want to do takes a huge amount of time to run.
Thanks in advance.
I wouldn't be concerned about space considerations primarily. An integer key would typically occupy four bytes. The varchar will occupy between 1 and 21 bytes, depending on the length of the string. So, if most are just a few characters, a varchar(20) key will occupy more space than an integer key. But not an extraordinary amount more.
Both, by the way, can take advantage of indexes. So speed of access is not particularly different (of course, longer/variable length keys will have marginal effects on index performance).
There are better reasons to use an auto-incremented primary key.
You know which values were most recently inserted.
If duplicates appear (which shouldn't happen for a primary key of course), it is easy to determine which to remove.
If you decide to change the "name" of one of the entries, you don't have to update all the tables that refer to it.
You don't have to worry about leading spaces, trailing spaces, and other character oddities.
You do pay for the additional functionality with four more bytes in a record devoted to something that may not seem useful. However, such efficiencies are premature and probably not worth the effort.
Gordon is right (which is no surprise).
Here are the considerations for you not to worry about, in my view.
When you're dealing with dozens of megarows or less, storage space is basically free. Don't worry about the difference between INT and VARCHAR(20), and don't worry about the disk space cost of adding an extra column or two. It just doesn't matter when you can buy decent terabyte drives for about US$100.
INTs and VARCHARS can both be indexed quite efficiently. You won't see much difference in time performance.
Here's what you should worry about.
There is one significant pitfall in index performance, that you might hit with character indexes. You want the columns upon which you create indexes to be declared NOT NULL, and you never want to do a query that says
WHERE colm IS NULL /* slow! */
or
WHERE colm IS NOT NULL /* slow! */
This kind of thing defeats indexing. In a similar vein, your performance will suffer bigtime if you apply functions to columns in search. For example, don't do this, because it too defeats indexing.
WHERE SUBSTR(colm,1,3) = 'abc' /* slow! */
One more question to ask yourself. Will you uniquely identify the rows in your subsidiary tables, and if so, how? Do they have some sort of natural compound primary key? For example, you could have these columns in a "child" table.
parent varchar(20) pk fk to parent table
birthorder int pk
name varchar(20)
Then, you could have rows like...
parent birthorder name
homer 1 bart
homer 2 lisa
homer 3 maggie
But, if you tried to insert a fourth row here like this
homer 1 badbart
you'd get a primary key collision because (homer,1) is occupied. It's probably a good idea to work how you'll manage primary keys for your subsidiary tables.
Character strings containing numbers sort funny. For example, '2' comes after '101'. You need to be on the lookout for this.
The main benefit you get from numeric values that that they are easier to 'index'. Indexing is a process that MySQL uses to make it easier to find a value.
Typically, if you want to find a value in a group, you have to loop through the group looking for your value. That is slow and has a worst case of O(n). If instead your data was in a nice, searchable format -- like a binary search tree, if could be found in O(lon n), much faster.
Indexing is the process MySQL uses to prepare data to be searched, it generates search trees and other clever do-bobs that will make finding data quick. It makes many searches much faster. However, to do it, it has to compare the value you are searching for to various 'key' values to determine if your value is greater than or less than the key.
This comparison can be done on non-numeric values. However, comparing non-numeric values is much slower. If you want to be able to quickly look up data, your best bet is you have a integer 'key' that you use.
Numeric row id's have many advantages over a string based id.
Most of them are mentioned in other answers.
1. One of them is indexing. Primary keys are by default indexed in a relational database. So, having a numeric key is always more efficient.
2. Numeric fields are stored much more efficiently
2. Joins are much faster with numeric keys.
3. A row id could be a foreign key. Numeric id's are compact to store, making them efficient
4. I think using a auto-increment on primary key has its advantages too
-Thanks
_san