smallest storage of integer array in mysql? - mysql

I have a table of user entries, and for every entry I have an array of (2-byte) integers to store (15-25, sporadically even more). The array elements will be written and read all at the same time, it is never needed to update or to access them individually. Their order matters. It makes sense to think of this as an array object.
I have many millions of these user entries and want to store this with the minimum possible amount of disk space. I'm however struggling with MySQL's lack of Array datatype.
I've been considering the following options.
Do it the MySQL way. Make a table my_data with columns user_id, data_id and data_int. To make this efficient, one needs an index on user_id, totalling well over 10 bytes per integer.
Store the array in text format. This takes ~6.5 bytes per integer.
making 35-40 columns ("enough") and having -32768 be 'empty' (since this value cannot occur in my data). This takes 3.5-4 bytes per integer, but is somewhat ugly (as I have to impose a strict limit on the number of elements in the array).
Is there a better way to do this in MySQL? I know MySQL has an efficient varchar type, so ideally I'd store my 2-byte integers as 2-byte chars in a varchar (or a similar approach with blob), but I'm not sure how to do that. Is this possible? How should this be done?

You could store them as separate SMALLINT NULL columns.
In MyISAM this this uses 2 bytes of data + 1 bit of null indicator for each value.
In InnoDB, the null indicators are encoded into the column's field start offset, so they don't take any extra space, and null values are not actually stored in the row data. If the rows are small enough that all the offsets are 1 byte, then this uses 3 bytes for every existing value (1 byte offset, 2 bytes data), and 1 byte for every nonexistent value.
Either of these would be better than using INT with a special value to indicate that it doesn't exist, since that would be 4 bytes of data for every value.
See NULL in MySQL (Performance & Storage)

The best answer was given in the comments, so I'll repost it here with some use-ready code, for further reference.
MySQL has a varbinary type that works really well for this: you can simply use PHP's pack/unpack functions to convert them to and from binary form, and store that binary form in the database using varbinary. Example code for the conversion is below.
function pack24bit($n) { //input: 24-bit integer, output: binary string of length 3 bytes
$b3 = $n%256;
$b2 = $n/256;
$b1 = $b2/256;
$b2 = $b2%256;
return pack('CCC',$b1,$b2,$b3);
}
function unpack24bit($packed) { //input: binary string of 3 bytes long, output: 24-bit int
$arr = unpack('C3b',$packed);
return 256*(256*$arr['b1']+$arr['b2'])+$arr['b3'];
}

Related

MySQL - What's the best way to store a static array size of TINYINT in a column

I need to store an array of unsigned TINYINT in a MySQL table for each user.
The array length is constant and I don't need to do any search or sorting in it.
Only its values are changed over time.
My goal is to have the values stored in a way that the data size remains as close as N x TINYINT for each line and hopefully readable.
I was considering 2 solutions:
Solution 1:
| user_id | TINYINT_1 | TINYINT_... | TINYINT_N |
Solution 2:
| user_id | JSON array [TINYINT_1, TINYINT_..., TINYINT_N] |
The second seems cleaner as I don't need to give N useless names, but from what I understand I have have no control on the type of value used to store data in a JSON array and I'm afraid that it increase the final memory size way more than N x TINYINT per line.
Is there way to control the type of values or some other smarter ways to do it?
Thanks for your advises.
One TINYINT takes one byte. The only way to ensure the storage of N TINYINTs is N x TINYINT bytes is to store them as a BINARY(N) up to N of 255, and BLOB if it's longer. That is, each TINYINT gets one byte in a binary string. That's not readable at all, but it is the most compact way to store it.
Because you would be responsible for interpreting this string byte-by-byte, there is no way it could be misinterpreted, or the elements treated as some other data type. That's entirely up to your code that unpacks the string.
MySQL does not have an array type (like for example PostgreSQL does).
If you want the array of TINYINT to be stored in a readable fashion, you could store it as a string of hex digits, with two digits per TINYINT. This takes exactly twice the space of the BINARY(N) solution.
You can also store the numbers as a string of comma-separated text numbers in base 10, which is more readable, but takes more space.
You can also use JSON, which allows for an array of digits, but it takes even more space because it stores numbers in base 10, and there need to be [ ] array delimiters. And you already thought of the possibility that JSON allows arbitrary types for the array elements. MySQL supports JSON schema, but not automatically. You'd have to write a CHECK constraint. There's an example here: https://dev.mysql.com/doc/refman/8.0/en/json-validation-functions.html

MySQL: parse and cast strings which contain numbers with units

I have a table that has a column holding string values that are numbers and units. The values contain a numerical value in the prefix composed of integers and one decimal.
Some examples of these values would be following:
"16 GB", "8.5gb", "15.99345 GHz", "25L"
Is there a way I can use the cast function to first parse the string values that contain numbers and decimals and only do the cast on that portion of the values?
This is what I had in mind
select * from my_table
where cast( numparse( my_column ) as signed ) > 10
Thanks in advance, I'm fairly new to SQL so any help would be appreciated.
Yes, you could write a stored procedure that does some sort of string parsing, or use a regex as in #ladd2025's answer...
But then you'd be redoing this conversion on every query. There's the cost of the conversion, but it also means you cannot take advantage of indexing. A query like where parse_the_thing( thing ) > 10 has to do a full table scan. Whereas if thing were an indexed number to begin with where thing > 10 is a very fast indexed query. This a problem with storing "formatted" information, you have to strip the formatting every time you want to do something with it.
You'd be far better off normalizing your stored data to store the magnitude as a numeric data type such as bigint, double, or numeric, and the unit as an enum column. Or consider if it makes sense to store all these different units in the same table; does it make sense to compare 8.5 gb with 15.99 Ghz?
8.5gb stored in bytes would become the bigint 8,500,000,000 (the exact value depends on whether it's 1000 bytes or 1024 bytes) with the unit bytes. 15.99345 GHz might become the bigint 15,993,450,000 with the unit Hz. And so on.
You can accomplish this by adding the new columns to your table, and doing the update to convert from the strings to the units and quantity. And then change whatever is inputting the values to do the same. You can continue to store the original human formatted string if you like, but you might be better off not and applying the formatting as needed.
This makes your queries much simpler, less chance of bugs. And they can take advantage of indexing, so they'll be much, much faster.
You could use REGEXP_REPLACE:
SELECT *
FROM tab
WHERE CAST(REGEXP_REPLACE(my_column, '[^0-9/.]', '') AS signed) > 10;
DBFiddle Demo
Just use the CAST() function. If you're casting to a numeric type, it will just parse the prefix and ignore the rest.
mysql> select cast('12.45gb' as signed);
+---------------------------+
| cast('12.45gb' as signed) |
+---------------------------+
| 12 |
+---------------------------+

mySQL - Does Int(9.455.487) take more space than string(John) in mySQL?

I understood that in a database an int takes less space than a string. But what if the int is really longer than the string. For example 9.455.487 vs "John". Which one will take more space? TY
From the documentation, size of int is 4 bytes, whereas for char it is "M × w bytes, 0 <= M <= 255, where w is the number of bytes required for the maximum-length character in the character set." and M is the declared column size.
So when you talk of how much space is taken, the int will take up 4 bytes for a value as long as the value is within the range of int. A string like "John", if declared as char(4) will take up 4 * w bytes, so at least 4 bytes assuming w is 1.
Long story short, the size of a number is not how many characters long it is when you write it out, but the number of bytes to represent it in the binary form.
You should be aware of what "int" (integer) is and what strings are. Integer always has some length and that length means how many bytes are in it's binary representation. On the other hand, strings are sequences of bytes. So, depending of encoding, each symbol may be one or more bytes.
The thing that 9.455.487 is "longer" than "John" is irrelevant here. What is relevant - is how DBMS (or whatever other environment) will represent those things. You're seeing "longer" integer versus "shorter" string while it's not so, it's only a matter of "screen" representation (i.e. what you see on the screen).
Answering question - for MySQL, INT is 4 bytes, while string data types may have dynamic length - such as VARCHAR. Static string length date type is CHAR and from that viewpoint, your number and your string will have same length (4 bytes). Strings and integers are just different things to compare for "length". And visual representation should not confuse you. This entities have different internal structure, and, therefore, should not be compared on "length" according to their visual representation.
Also, you should be aware that not always integer will have 4 bytes length - since even in MySQL your number may belong to, for example, BIGINT data type (which is 8 bytes length). And, as mentioned above, for strings there's also encoding issue. For instance, UTF-8 encoded string may have two (or even more) bytes to represent some non-ASCII symbols. In this case each symbol will add more that 1 byte to total string length.

Recommended way to store a string in this case?

I am storing strings and 99.5+% are less than 255 characters, so I store them in a VARCHAR(255).
The thing is, some of them can be 4kb or so. What's the best way to store those?
Option #1: store them in another table with a pointer to the main.
Option #1.0: add an INT column with DEFAULT NULL and the pointer will be stored there
Option #1.1: the pointer will be stored in the VARCHAR(255) column, e.g 'AAAAAAAAAAA[NUMBER]AAAAAAAAAAAA'
Option #2: increase the size of VARCHAR from 255 to 32767
What's the best of the above, Option #1.0, Option #1.1 or Option #2, performance wise?
Increase the size of your field to fit the max size of your string. A VARCHAR will not use the space unless needed.
VARCHAR values are stored as a 1-byte or 2-byte length prefix plus
data. The length prefix indicates the number of bytes in the value. A
column uses one length byte if values require no more than 255 bytes,
two length bytes if values may require more than 255 bytes.
http://dev.mysql.com/doc/refman/5.0/en/char.html
The MySQL Definition says that VARCHAR(N) will take up to L + 1 bytes if column values require 0 – 255 bytes, L + 2 bytes if values may require more than 255 bytes where L is the length in bytes of the stored string.
So I guess that option #2 is quite okay, because the small strings will still take less space than 32767 bytes.
EDIT:
Also imagine the countless problems options 1.0 and 1.1 would raise when you actually want to query a string without knowing whether it exceeds the length or not.
Option #2 is clearly best. It just adds 1 byte to the size of each value, and doesn't require any complicated joins to merge in the fields from the second table.

Correct way to store a bit array

I'm working on a project that needs to store something like
101110101010100011010101001
into the database. It's not a file or archive: it's only a bit array, and I think that storing it into a varchar column is waste of space/performance.
I've searched about the BLOB and the VARBINARY type. But both of then allows to insert a value like 54563423523515453453, that's not exactly a bit array.
For sure, if I store a bit array like 10001000 into a BLOB/varbinary/varchar column, it will consume more than a byte, and I want that the minimum space is consumed. In the case of eight bits, it needs to consume only one byte, 16 bits two bytes, and so on.
If it's not possible, then what is the best approach to waste the minimum amount of space in this case?
Important notes: The size of the array is variable, and is not divisible by eight in every situation. Sometimes I will need to store 325 bits, other times 7143 bits....
In one of my previous projects, I converted streams of 1's and 0' to decimal, but they were shorter. I dont know if that would be applicable in your project.
On the other hand, imho, you should clarify what will you need to do with that data once you get it stored. Search? Compare? It might largely depend on the purpose of the database.
Could you gzip it and then store it? Is that applicable?
Binary is a string representation of a number. The string
101110101010100011010101001
represents the number
... + 1*25 + 0*24 + 1*23 + 0*22 + 0*21 + 1*20
As such, it can be stored in a 32-bit integer if were to be converted from a binary string to the number it represents. In Perl, one would use
oct('0b'.$binary)
But you have a variable number of bits. Not a problem! Just process them 8 at a time to create a string of bytes to place in a BLOB or similar.
Ah, but there's a catch. You'll need to add padding to get a number divisible by 8, which means you'll have to use a means of removing that padding. A simple approach if there's a known maximum length is to use a length prefix. e.g. If you know the number of bits is never going to exceed 65,535, encode the number of bits in the first two bytes of the string.
pack('nB*', length($binary), $binary)
which is reverted using
my ($length, $binary) = unpacked('nB*', $packed);
substr($binary, $length) = '';