Optimising Storage Space: Many rows & columns with the same values - mysql

I have a multiple tables which store 100 million+ rows of data each. There are only a few possible unique values for any given column, so many of the columns have duplicate values.
When I initially designed the schema I decided to use secondary linked tables to store the actual values, in order to optimise the storage space required for the database.
For example:
Instead of a table for storing user agents like this:
id (int)
user_agent (varchar)
I am using 2 tables like this:
Table 1
id (int)
user_agent_id (int)
Table 2
id (int)
user_agent (varchar)
When there are 100 million+ rows I found this schema saves a massive amount of storage space because there are only a few hundred possible user agents and those strings make up the majority of the data.
The issue I am running in to is:
Using linked tables to store so much of the string data across many different tables is adding overhead on the development side and making querying the data much slower since joins are required.
My question is:
Is there a way I can put all of the columns in a single table, and force mysql to not duplicate the storage required for columns with duplicate values? I'm beginning to think there must be some built in way to handle this type of situation but I have not found anything in my research.
If I have 10 unique values for a column and 100 million+ rows why would MySQL save every value including the duplicates fully in storage rather than just a reference to the unique values?
Thank you!

After some digging and testing I found what seems to be the best solution: creating an index and foreign key constraint using the varchar column itself, rather than using an ID field.
INNODB supports foreign keys with varchar as well as int: https://dev.mysql.com/doc/refman/5.6/en/create-table-foreign-keys.html
Here is an example:
user_agents table:
user_agent (varchar, and a unique index)
user_requests table:
id
user_agent (varchar, foreign key constraint referencing user_agents table user_agent column)
other_columns etc...
I found that when using the varchar itself as the foreign key mysql will optimise the storage on its own, and will only store 1 varchar for each unique user_agent on the disk. Adding 10 million+ user_requests rows adds very little information to the disk.
I also noticed its even more efficient than using an ID to link the tables like in the original post. MySQL seems to do some magic under the hood and can link the columns with very little info on the disk. It's at least 100x more storage efficient than storing all the strings themselves, and several times more efficient than linking using IDs. You also get all the benefit of foreign keys and cascading. No joins are required to query the columns in either direction so the queries are very quick as well!
Cheers!

If I have 10 unique values for a column and 100 million+ rows why would MySQL save every value including the duplicates fully in storage rather than just a reference to the unique values?
MySQL has no way of predicting that you will always have only 10 unique values. You told it to store a VARCHAR, so it must assume you want to store any string. If it were to use a number to enumerate all possible strings, that number would actually need to be longer than the string itself.
To solve your problem, you can optimize storage by using a numeric ID referencing a lookup table. Since the number of distinct strings in your lookup table is in the hundreds, you need to use at least a SMALLINT (16-bit integer). You don't need to use a numeric as large as INT (32-bit integer).
In the lookup table, declare that id as the primary key. That should make it as quick as possible to do the joins.
If you want to do a join in the reverse directly — querying your 100M row table for a specific user agent, then index the smallint column in your large table. That will take more storage space to create the index, so make sure you need that type of query in each table before you create the index.
Another suggestion: Get a larger storage volume.

Related

MySQL query and insertion optimisation with varchar(255) UUIDs

I think this question has been asked in some way shape or form but I couldn't find a question that had asked exactly what I wish to understand so I thought I'd put the question here
Problem statement
I have built a web application with a MySQL database of say customer records with an INT(11) id PK AI field and a VARCHAR(255) uuid field. The uuid field is not indexed nor set as unique. The uuid field is used as a public identifier so its part of URLs etc. - e.g. https://web.com/get_customer/[uuid]. This was done because the UUID is 'harder' to guess for a regular John Doe - but understand that it is certainly not 'unguessable' in theory. But the issue now is that as the database is growing larger I have observed that the query to retrieve a particular customer record is taking longer to complete.
My thoughts on how to solve the issue
The solution that is coming to mind is to make the uuid field unique and also index the same. But I've been doing some reading in relation to this and various blog posts, StackOverflow answers on this have described putting indices on UUIDs as being really bad for performance. I also read that it will also increase the time it takes to insert a new customer record into the database as the MySQL database will take time to find the correct location in which to place the record as a part of the index.
The above mentioned https://web.com/get_customer/[uuid] can be accessed without having to authenticate which is why I'm not using the id field for the same. It is possible for me to consider moving to integer based UUIDs (I don't need the UUIDs to be universally unique - they just need to be unique for that particular table) - will that improve the the indicing performance and in turn the insertion and querying performance?
Is there a good blog post or information page on how to best set up a database for such a requirement - Need the ability to store a customer record which is 'hard' to guess, easy to insert and easy to query in a large data set.
Any assistance is most appreciated. Thank you!
The received wisdom you mention about putting indexes on UUIDs only comes up when you use them in place of autoincrementing primary keys. Why? The entire table (InnoDB) is built behind the primary key as a clustered index, and bulk loading works best when the index values are sequential.
You certainly can put an ordinary index on your UUID column. If you want your INSERT operations to fail in the astronomically unlikely event you get a random duplicate UUID value you can use an index like this.
ALTER TABLE customer ADD UNIQUE INDEX uuid_constraint (uuid);
But duplicate UUIDv4s are very rare indeed. They have 122 random bits, and most software generating them these days uses cryptographic-quality random number generators. Omitting the UNIQUE index is, I believe, an acceptable risk. (Don't use UUIDv1, 2, 3, or 5: they're not hard enough to guess to keep your data secure.)
If your UUID index isn't unique, you save time on INSERTs and UPDATEs: they don't need to look at the index to detect uniqueness constraint violations.
Edit. When UUID data is in a UNIQUE index, INSERTs are more costly than they are in a similar non-unique index. Should you use a UNIQUE index? Not if you have a high volume of INSERTs. If you have a low volume of INSERTs it's fine to use UNIQUE.
This is the index to use if you omit UNIQUE:
ALTER TABLE customer ADD UNIQUE INDEX uuid (uuid);
To make lookups very fast you can use covering indexes. If your most common lookup query is, for example,
SELECT uuid, givenname, surname, email
FROM customer
WHERE uuid = :uuid
you can create this so-called covering index.
ALTER TABLE customer
ADD INDEX uuid_covering (uuid, givenname, surname, email);
Then your query will be satisfied directly from the index and therefore be faster.
There's always an extra cost to INSERT and UPDATE operations when you have more indexes. But the cost of a full table scan for a query is, in a large table, far far greater than the extra INSERT or UPDATE cost. That's doubly true if you do a lot of queries.
In computer science there's often a space / time tradeoff. SQL indexes use space to save time. It's generally considered a good tradeoff.
(There's all sorts of trickery available to you by using composite primary keys to speed things up. But that's a topic for when you have gigarows.)
(You can also save index and table space by storing UUIDs in BINARY(16) columns and use UUID_TO_BIN() and BIN_TO_UUID() functions to convert them. )

Storing a 100k by 100k array in MySQL

I need to store a massive, fixed size square array in MySQL. The values of the array are just INTs but they need to be accessed and modified fairly quickly.
So here is what I am thinking:
Just use 1 column for primary keys and translate the 2d arrays indexes into single dimensional indexes.
So if the 2d array is n by n => 2dArray[i][j] = 1dArray[n*(i-1)+j]
This translates the problem into storing a massive 1D array in the database.
Then use another column for the values.
Make every entry in the array a row.
However, I'm not very familiar with the internal workings of MySQL.
100k*100k makes 10 billion data points, which is more than what 32 bits can get you so I can't use INT as a primary key. And researching stackoverflow, some people have experienced performance issues with using BIGINT as primary key.
In this case where I'm only storing INTs, would the performance of MySQL drop as the number of rows increases?
Or if I were to scatter the data over multiple tables on the same server, could that improve performance? Right now, it looks like I won't have access to multiple machines, so I can't really cluster the data.
I'm completely flexible about every idea I've listed above and open to suggestions (except not using MySQL because I've kind of committed to that!)
As for your concern that BIGINT or adding more rows decreases performance, of course that's true. You will have 10 billion rows, that's going to require a big table and a lot of RAM. It will take some attention to the queries you need to run against this dataset to decide on the best storage method.
I probably recommend using two columns for the primary key. Developers often overlook the possibility of a compound primary key.
Then you can use INT for both primary key columns if you want to.
CREATE TABLE MyTable (
array_index1 INT NOT NULL,
array_index1 INT NOT NULL,
datum WHATEVER_TYPE NOT NULL,
PRIMARY KEY (array_index1, array_index2)
);
Note that a compound index like this means that if you search on the second column without an equality condition on the first column, the search won't use the index. So you need a secondary index if you want to support that.
100,000 columns is not supported by MySQL. MySQL has limits of 4096 columns and of 65,535 bytes per row (not counting BLOB/TEXT columns).
Storing the data in multiple tables is possible, but will probably make your queries terribly awkward.
You could also look into using table PARTITIONING, but this is not as useful as it sounds.

Why would I use ID in MySQL when I can search with the username?

In many tutorials about MySQL they often use an ID which is made automatically when an user has made an account. Later on the ID is used to search about that profile or update that profile.
Question: Why would I use ID in MySQL when I can search with the username?
I can use the username to search in a MySQL table too, so what are the pros and cons when using an ID?
UPDATE:
Many thanks for your reactions!
So let's say a user wants to log in on a website. He will provide an username and password. But for my code I first have to do an query to know the ID, because the user don't know the ID. Is this correct or is there another way to do it?
If I would store the ID of the user in a cookie and when the user logs in then I first look if the ID is the right one with the username. And then checks if the password is correct. Then I can use the ID for queries. Is that an good idea? Of course I will use prepared statements on all of this.
Please refer to this post.
1 - It's faster. A JOIN on an integer is much quicker than a JOIN on a string field or combination of fields. It's more efficient to compare integers than strings.
2 - It's simpler. It's much easier to map relations based on a single numeric field than on a combination of other fields of varying data types.
3 - It's data-independent. If you match on the ID you don't need to worry about the relation changing. If you match on a name, what do you do if their name changes (i.e. marriage)? If you match on an address, what if someone moves?
4 - It's more efficient If you cluster on an (auto incrementing) int field, you reduce fragmentation and reduce overall size of the data set. This also simplifies indexes needed to cover your relations.
From "an ID which is made automatically" I assume you are talking about an integer column having the attribute AUTO_INCREMENT.
Several reasons a numeric auto-incremented PK is better than a string PK:
A value of type INT is stored on 4 bytes, a string uses 1 to 4 bytes for each character, depending on the charset and the character (plus 1 or 2 extra bytes that store the actual string length for VARCHAR types). Except when your string column contains only 2-3 ASCII characters, an INT always takes less space than a string; this affects the next two entries from this list.
The primary key is an index and any index is used to speed up the search of rows in the table. The search is done by comparing the searched value with the values stored in the index. Comparing integral numeric values (INT vs. INT) requires a single CPU operation; it works very fast. Comparing string values is harder: the corresponding characters from the two strings are compared taking into the account the characteristics of their encoding, collation, upper/lower case etc; usually more than one pairs of characters need to be compared; this takes a lot of CPU operations and is much slower than comparing INTs.
The InnoDB storage engine keeps a reference to the PK in every index of the table. If the PK of the table is not set or not numeric, InnoDB internally creates a numeric auto-incremented column and uses it instead (and makes the visible PK refer to it too). This means you don't waste any database space by adding an extra ID column (or you don't save anything by not adding it, if you prefer to get it the other way around).
Why does InnoDB work this way? Read the previous item again.
The PK of a table usually migrates as a FK in a related table. This means the value of the PK column of each rows from the first table is duplicated into the FK field in the related table (think of the classic example of employee that is works in a department; the department_id column of department is duplicated into the employee table). Here the column type affects both the used space and the speed (the FK is usually used for JOIN, WHERE and GROUP BY clauses in queries).
Here is one reason to do it from a lot.
If the username is really primary key for your relation using the surrogate key (ID) is at least the space optimization. In normalization process your relation can be splited to the several tables. Replacing the username(varchar 30) by ID(int) in related tables as foreign key can save a lot of space.

Question about how foreign key data is stored in SQL

I know this is ultra-basic, but it's an assumption I've always held and would like to validate that it's true (in general, with the details specific to various implementations)
Let's say I have a table that has a text column "Fruit". In that column only one of four values ever appears: Pear, Apple, Banana, and Strawberry. I have a million rows.
Instead of repeating that data (on average) a quarter million times each, if I extract it into a another table that has a Fruit column and just those four rows, and then make the original column a foreign key, does it save space?
I assume that the four fruit names are stored only once, and that the million rows now have pointers or indexes or some kind of reference into the second table.
If my row values are longer than short fruit names I assume the savings/optimization is even larger.
The data types of the fields on both sides of a foreign key relationship have to be identical.
If the parent table's key field is (say) varchar(20), then the foreign key fields in the dependent table will also have to be varchar(20). Which means, yes, you'd have to have X million rows of 'Apple' and 'Pear' and 'Banana' repeating in each table which has a foreign key pointing back at the fruit table.
Generally it's more efficient to use numeric fields as keys (int, bigint), as those can have comparisons done with very few CPU instructions (generally a direct one cpu instruction comparison is possible). Strings, on the other hand, require loops and comparatively expensive setups. So yes, you'd be better off to store the fruit names in a table somewhere, and use their associated numeric ID fields as the foreign key.
Of course, you should benchmark both setups. These are just general rules of thumbs, and your specific requirements/setup may actually work faster with the strings-as-key version.
That is correct.
You should have
table fruits
id name
1 Pear
2 Apple
3 Banana
4 Strawberry
Where ID is a primary key.
In your second table you will use just the id of this table. That will save you physical space and will make your select statements work faster.
Besides, this structure would make it very easy for you to add new fruits.
Instead of repeating that data (on average) a quarter million times
each, if I extract it into a another table that has a Fruit column and
just those four rows, and then make the original column a foreign key,
does it save space?
No if the "Fruit" is the PRIMARY KEY of the "lookup" table, so it must also be the FOREIGN KEY in the "large" table.
However if you make a small surrogate PRIMARY KEY (such as integer "id") in the "lookup" table and than use that as the FOREIGN KEY in the "large" table, you'll save space.
At first yes it will save space because int - 4 bytes, TINYINT - 1 byte. Secondly, searching by this field with TYPE INT will be faster than by VARCHAR. In addition to this, you can use ENUM if your data doesn't change in future. With enum you will get the same maybe faster result than with secondary table and you will avoid additional join.
Normalization is not just about space, it's often about redundancy and modelling the data behavior and also about updating just one row for a change - and reducing the scope of locks by updating only the minimal amount of data.
Sadly, you assume wrong: the values are physically stored repeatedly for each referencing table. Some SQL products do store the value just once but most don't, notably the more popular ones which are based on contiguous storage on disk.
This is the reason end users feel the need to implement their own points in the guise of use integer 'surrogate keys'. A system surrogate would be preferable e.g. wouldn't be visible to users, in the same way an index's 'values' are maintained by the system and cannot be manipulated directly by users. The problem with rolling your own is they become part of the logical model.

guid as primary key?

I have a mysql database which has 3 tables that have to be joined together. I receive smaller databases that must feed this mysql database, appending the new data as I get it. The problem I have is the smaller dbs that I get are generated by an outside application and are not really meant to be used all together. Therefore, when I utilize the schema of the smaller database, I have no way to know how they all the records from the 3 tables belong together.
I was thinking about inserting a guid to serve as a primary key that I can add to the tables and insert when I insert all of the new data.
However, I am leery of using a char field (used to store the guid) as a key. Is this a valid concern, or is using char field knowing that it will always be a guid a sufficient solution? Can someone recommend a better approach?
Thanks
MySQL does not provide a GUID/UUID type, so you would need to generate the key in the code you're using to insert rows into the DB. A char(32) or char(36) (if including hyphens) is the best data type to store a GUID in lieu of an actual GUID data type.
I'm sorry I'm not 100% familiar with MYSQL, in SQL Express, there is a Unique Identifier type you can set a column to which is really a GUID. You can even set it to auto-number itself so it picks random ones.
My boss at work HATES GUIDS though, and we work with offline/online systems a lot, so he came up with another system in which each feeding database is assigned an ID (called DEPT), and whenever he inserts into the master table from one of the smaller ones, he writes its DEPT into a seperate Integer column so its easilly sortable.
To implement this, you'd make a second key (making each table the import has to be performed on a dual-key table).
Example:
PrimaryKey1 DEPT Name
1 0 Slink
2 0 Fink
3 0 Werd
1 1 Slammer
2 1 Blam
1 2 Werrr
2 2 Soda
Mysql 8.0 has the UUID() function which should do the job.
Unfortunately, there was a bug which means you could not use it in a DEFAULT expression. But apparently, it is fixed in 8.0.19, as stated in this answer
Some DB like ms sql server provide a guid data type, I am not sure about mysql.
In general the is no problem with char or varchar as primary key unless they are too long. Normally integers are preferred because they are a bit faster but it depends if this matters for you much.
Effectively you could also use a composite primary key. One component could be your original primary key which is unique in one db only. The second component could be the number of the database, if you are able to assign a unique number to each database.
You could also use the following scheme:
int newId = dbNumber * 10000 + iDInSmallDb;
So the last 4 digits are the original db, the other digits are the db number.