Does storing ASCII files as a BLOB field in MySQL or Oracle offer a reduction in storage requirements? - mysql

I have several thousand small ASCII files containing 3D cartesian coordinates for atoms in molecules (among other information) that I need to store somewhere.
A simple calculation told me that we will require several terrabytes of space, which may be reduced to several gigabytes at most, but is not manageable under current infrastructural constraints. Somebody told me some people have stored similar numbers of files (of the same format, but sometimes bzipped) in MySQL and Oracle as a BLOB field. My question is, does storing such files as BLOB offer some form of reduction in storage requirements? If yes, how much of a reduction can I expect?
This is example text from an ASCII file that needs to be stored:
#<TRIPOS>MOLECULE
****
5 4 1 1 0
SMALL
GAST_HUCK
#<TRIPOS>ATOM
1 C1 -9.7504 2.6683 0.0002 C.3 1 <1> -0.0776
2 H1 -8.6504 2.6685 0.0010 H 1 <1> 0.0194
3 H2 -10.1163 2.1494 -0.8981 H 1 <1> 0.0194
4 H3 -10.1173 3.7053 -0.0004 H 1 <1> 0.0194
5 H4 -10.1176 2.1500 0.8982 H 1 <1> 0.0194
#<TRIPOS>BOND
1 1 2 1
2 1 3 1
3 1 4 1
4 1 5 1
#<TRIPOS>SUBSTRUCTURE
1 **** 1 TEMP 0 **** **** 0 ROOT
#<TRIPOS>NORMAL
#<TRIPOS>FF_PBC
FORCE_FIELD_SETUP_FEATURE Force Field Setup information
v1.0 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NONE 0 0 0 0 1 0 0 0 0 0 0 0 0

Storing data in a BLOB column offers no form of reduction in storage requirements. The storage requirements for BLOB types are simple:
TINYBLOB L + 1 bytes, where L < 28
BLOB L + 2 bytes, where L < 216
MEDIUMBLOB L + 3 bytes, where L < 224
LONGBLOB L + 4 bytes, where L < 232
L represensts the length of the string data in bytes.
See Storage Requrements for further details.
If there is no need to search the contents of the molecule files in your database, you can reduce the storage requirements by compressing the data prior to inserting it or using the MySQL COMPRESS() function on insert.
I think that addressed your main question, and based on those figures and how many files you plan to store based on an average size, you can calculate how much storage space will be consumed by the BLOB type columns.

Related

Encode data into 1 byte

I have to encode data to 1 byte. I have the following data as of now.
size - 500 ml and 1 litre
Frequency - 0 to 12
% - 0-100
So i decided to break the data into the following -
0 0 0 0 0 0 0 0
1st bit - Size - 0 for 500ml and 1 for 1 litre
2-5 bits - Frequency - 0 to 12 (0000 for 0 and 1100 for 12)
I am not sure how to get the % in this setting. Am i looking into solving this in a wrong way? Is there any other way to do it. Any direction is highly appreciated.
You are left with 3 bits. you need to store a value between 0-100, which atleast needs 7 bits. (2^7 = 128). However, if you only need 8 different percentage values, you can get away with using 3 bits

Database size calculation?

What is the most accurate way to estimate how big a database would be with the following characteristics:
MySQL
1 Table with three columns:
id --> big int)
field1 --> varchar 32
field2 --> char 32
there is an index on field2
You can assume varchar 32 is fully populated (all 32 characters). How big would it be if each field is populated and there are:
1 Million rows
5 Million rows
1 Billion rows
5 Billion rows
My rough estimate works out to: 1 byte for id, 32 bits each for the other two fields. Making it roughly:
1 + 32 + 32 = 65 * 1 000 000 = 65 million bytes for 1 million rows
= 62 Megabyte
Therefore:
62 Mb
310 Mb
310 000 Mb = +- 302Gb
1 550 000 Mb = 1513 Gb
Is this an accurate estimation?
If you want to know the current size of a database you can try this:
SELECT table_schema "Database Name"
, SUM(data_length + index_length) / (1024 * 1024) "Database Size in MB"
FROM information_schema.TABLES
GROUP BY table_schema
My rough estimate works out to: 1 byte for id, 32 bits each for the other two fields.
You're way off. Please refer to the MySQL Data Type Storage Requirements documentation. In particular:
A BIGINT is 8 bytes, not 1.
The storage required for a CHAR or VARCHAR column will depend on the character set in use by your database (!), but will be at least 32 bytes (not bits!) for CHAR(32) and 33 for VARCHAR(32).
You have not accounted at all for the size of the index. The size of this will depend on the database engine, but it's definitely not zero. See the documentation on the InnoDB row structure for more information.
On the MySQL website you'll find quite comprehensive information about storage requirements:
http://dev.mysql.com/doc/refman/5.6/en/storage-requirements.html
It also depends if you use utf8 or not.

MySQL Data Type to Store Negative Number

I have a rating system that ranges from -1 to 5 through 0.
So i need to store the following values
-1
0
1
2
3
4
5
Which is the best data type considering i want to fetch a total count of each value (not sum) ?
I'd go with TINYINT, documented here. It takes 1 byte of storage and its range is -128 to 127.

What is the maximum declared column length of MySql TEXT / BLOB types?

Prelude
My question disregards "the largest value you actually can transmit between the client and server is determined by the amount of available memory and the size of the communications buffers".
I also don't take Unicode into account here. I'm aware that if a character uses more than 1 byte of storage, the actual maximum length (number of characters) of TEXT columns will decrease.
When consulting the MySql docs here: http://dev.mysql.com/doc/refman/5.5/en/storage-requirements.html, I can derive 2 answers for my question...
1) The more obvious answer:
TINYBLOB : 2 ^ 8 - 1 = 255
BLOB : 2 ^ 16 - 1 = 65535
MEDIUMBLOB : 2 ^ 24 - 1 = 16777215
LONGBLOB : 2 ^ 32 - 1 = 4294967295
2) The bit more complicated answer:
TINYBLOB : 2 ^ 8 - 1 = 255
BLOB : 2 ^ 16 - 2 = 65534
MEDIUMBLOB : 2 ^ 24 - 3 = 16777213
LONGBLOB : 2 ^ 32 - 4 = 4294967292
MySql stores the size of the actual data along with that data. And in order to store that size it will need:
1 byte when data < 256 B
2 bytes when data < 64 KB
3 bytes when data < 16 MB
4 bytes when data < 4 GB
So to store the data plus the size of the data, and prevent it from exceeding 256 / 64K / 16M / 4G bytes of needed storage, you will need the -1 / -2 / -3 / -4 factor when determining the maximum declared column length (not -1 / -1 / -1 / -1). I hope this makes sense :)
The question
Which of these 2 answers is correct?
(Assuming one of them is.)
It's answer 1.
From the doc you link to:
These correspond to the four BLOB types and have the same maximum lengths and storage requirements. See Section 11.6, “Data Type Storage Requirements”.
That other page has a table with those constraints. For LONGBLOB, the storage required is:
L + 4 bytes, where L < 232
L represents the actual length in bytes of a given string value.
As for the maximum declared column length, just try it out yourself:
mysql> create table foo(a blob(4294967295));
Query OK, 0 rows affected (0.08 sec)
mysql> create table bar(a blob(4294967296));
ERROR 1439 (42000): Display width out of range for column 'a' (max = 4294967295)
(You can't declare a size for the other three blob types.)

Storing distance matrix as VarChar(Max) in SQL Server database

I'm developing a fleet scheduling application and look for an effective way to store distances between geographic locations.
The application-code accesses the matrix as a two-dimensional array double[,].
In order to make the matrix persistent, I currently serialize the matrix as a string. After serialization it looks like this:
"1 4 9 8 3 6 \n
5 6 7 9 3 6 \n
34 4 5 6 6 7 \n"
Then it is stored in a column of type varchar(max) in a SQL Server 2008 database. However, I wonder if this string could get too big.
Assuming that each entry has one digit and neglecting white-spaces and "\n"s, theoretically I could store the distances of around 46000 locations (square-root of 2 147 483 647 - the size of varchar(max)) in one entry. This would be sufficient in my context.
Does this approach have any severe disadvantages? Would it be better to store distances in an extra table, where each row contains one distance between two locations?
If 100 users of our application stored 1000 locations respectively, I would have 100000000 = 100 * 1000 * 1000 rows in such a table....
You could just compress the array into a blob field. That will be the most efficient. Instead of serializing to a string, compress to a byte array, and vice versa on read.