How to compress columns in MySQL? - mysql

I have a table which stores e-mail correspondences. Every time someone
replies, the whole body of the trail is also included and saved into
the database (and I need it that way because the amount of application
level changes to rectify that are going to be too high).
The size of the mail text column is 10000. But, I am having trouble storing text more than that. As I am not sure, how many correspondences can occur, I don't know what a good number will be for the column.
The engine is InnoDB. Can I use some kind of columnar compression technique in MySQL to avoid increasing the size of the column?
And, what if I go ahead and increase the varchar column to, say, 20000. The table has about 2 million records. Will that be a good thing to do?

You are probably looking for MySQL COMPRESS() and UNCOMPRESS() function to compress data for storage and retrieval respectively.
Also look at InnoDB Compression Usage.

As long as the data doesn't need editing, you can use the archive engine.

This answer is specific to Percona
Percona introduced a compressed column format a while ago. That you can use on CREATE or ALTERs
CREATE TABLE test_compressed (
id INT NOT NULL PRIMARY KEY,
value MEDIUMTEXT COLUMN_FORMAT COMPRESSED
);
Reference: https://www.percona.com/doc/percona-server/5.7/flexibility/compressed_columns.html

For me the best way to use text data compression is to use a Percona compressed column format.
ALTER TABLE `tableName` MODIFY `mail` TEXT COLUMN_FORMAT COMPRESSED NOT NULL;
I've tested compression on table used as cache, storing mainly HTML data, the size decreased from 620 MB to 110.6MB.
I think you should consider using TEXT type instead of long VARCHAR.
Data fields are stored separately from innodb clustered index and it can affect and probably improve the performance of your database.

You have a few different options:
Wait for the RFE to add column compression to MySQL (see https://bugs.mysql.com/bug.php?id=106541) - unlikely this will ever be done
Use application level compression and decompression - much more work involved in doing this
Rely on MySQL's compress and uncompress functions to do this for you (see https://dev.mysql.com/doc/refman/8.0/en/encryption-functions.html#function_compress) - these are not reliable as they depend on how MySQL was compiled (zlib or not) - and they don't give great results a lot of the time
Don't worry about the file size as disk space is cheap and simply change the column type to TEXT (see https://dev.mysql.com/doc/refman/8.0/en/blob.html)
Often the best option if disk space is your main concern is changing the table to be compressed using: ALTER TABLE t1 ROW_FORMAT = COMPRESSED; - for emails this can give very good compression and if need be it can be tuned for even better compression for your particular workload (see https://dev.mysql.com/doc/refman/8.0/en/innodb-compression-tuning.html)

Related

how to optimize table in MYSQL?

I have one database and it contains 100 tables , in which three tables size are increase to 3 GB to 8 GB.what i do to reduce the table size.?
i am using optimize table command , it is working fine but size not decrease .
i am using percona tool kit command .
commands are working fine and completed successfully but
table size is same there is no effect on table size.
what will i do to solve this issue?
HOW I CAN OPTIMIZE TABLE WITH pt-online-schema-change for optimizing very large table?
my table size is 10 GB how can i decrease the table size.
If the data in your tables is already well organized on disk, optimize table won't help you.
You can:
delete old data
get rid of secondary indexes that you don't need
turn on compression for your tables
partition your tables and move certain partitions to other storage
How to shrink disk footprint?
Don't use INT (4 bytes) when MEDIUMINT (3 bytes) will suffice.
Don't use MEDIUMINT when SMALLINT (2 bytes) will suffice. Etc.
Don't use utf8 for hex strings.
Normalize repeated values. (Within reason)
Compress TEXT fields in the client, then use BLOB.
Do not PARTITION unless there is a valid reason for it; it is likely to make the footprint bigger, perhaps significantly bigger.
INDEX(a) is useless if you have INDEX(a,b)
OPTIMIZE TABLE is likely to be a waste of time. (As you found out.)
pt-online-schema-change (or 5.6) helps you make certain changes without locking the table a long time.
You have not even mentioned the basics -- MyISAM or InnoDB. (Some techniques differ.) Windows or Unix? (Some tools do not work on Windows.)
Let's see SHOW CREATE TABLE; we can probably come up with more suggestions.
Why is 10GB scaring you? Many systems deal with 100GB. 1TB gets a bit scary.

MySQL Memory Engine vs InnoDB on RAMdisk

I'm writing a bit of software that needs to flatten data from a hierarchical type of format into tabular format. Instead of doing it all in a programming language every time and serving it up, I want to cache the results for a few seconds, and use SQL to sort and filter. When in use, we're talking 400,000 writes and 1 or 2 reads over the course of those few seconds.
Each table will contain 3 to 15 columns. Each row will contain from 100 bytes to 2,000 bytes of data, although it's possible that in some cases, some rows may get up to 15,000 bytes. I can clip data if necessary to keep things sane.
The main options I'm considering are:
MySQL's Memory engine
A good option, almost specifically written for my use case! But.. "MEMORY tables use a fixed-length row-storage format. Variable-length types such as VARCHAR are stored using a fixed length. MEMORY tables cannot contain BLOB or TEXT columns." - Unfortunately, I do have text fields with a length up to maybe 10,000 characters - and even that is a number that is not specifically limited. I could adjust the varchar length based on the max length of text columns as I loop through doing my flattening, but that's not totally elegant. Also, for my occasional 15,000 character row, does that mean I need to allocate 15,000 characters for every row in the database? If there was 100,000 rows, that's 1.3 gb not including overhead!
InnoDB on RAMDisk
This is meant to run on the cloud, and I could easily spin up a server with 16gb of ram, configure MySQL to write to tmpfs and use full featured MySQL. My concern for this is space. While I'm sure engineers have written the memory engine to prevent consuming all temp storage and crashing the server, I doubt this solution would know when to stop. How much actual space will my 2,000 bytes of data consume when in database format? How can I monitor it?
Bonus Questions
Indexes
I will in fact know in advance which columns need to be filtered and sorted by. I could set up an index before I do inserts, but what kind of performance gain could I honestly expect on top of a ram disk? How much extra overhead to indexes add?
Inserts
I'm assuming inserting multiple rows with one query is faster. But the one query, or series of large queries are stored in memory, and we're writing to memory, so if I did that I'd momentarily need double the memory. So then we talk about doing one or two or a hundred at a time, and having to wait for that to complete before processing more.. InnoDB doesn't lock the table but I worry about sending two queries too close to each other and confusing MySQL. Is this a valid concern? With the MEMORY engine I'd have to definitely wait for completion, due to table locks.
Temporary
Are there any benefits to temporary tables other than the fact that they're deleted when the db connection closes?
I suggest you use MyISAM. Create your table with appropriate indexes for your query. Then disable keys, load the table, and enable keys.
I suggest you develop a discipline like this for your system. I've used a similar discipline very effectively.
Keep two copies of the table. Call one table_active and the second one table_loading.
When it's time to load a new copy of your data, use commands like this.
ALTER TABLE table_loading DISABLE KEYS;
/* do your insertions here, to table_loading */
/* consider using LOAD DATA INFILE if it makes sense. */
ALTER TABLE table_loading ENABLE KEYS; /* this will take a while */
/* at this point, suspend your software that's reading table_active */
RENAME TABLE table_active TO table_old;
RENAME TABLE table_loading TO table_active;
/* now you can resume running your software */
TRUNCATE TABLE table_old;
RENAME TABLE table_old TO table_loading;
Alternatively, you can DROP TABLE table_old; and create a new table for table_loading instead of the last rename.
This two-table (double-buffered) strategy should work pretty well. It will create some latency because your software that's reading the table will work on an old copy. But you'll avoid reading from an incompletely loaded table.
I suggest MyISAM because you won't run out of RAM and blow up and you won't have the fixed-row-length overhead or the transaction overhead. But you might also consider MariaDB and the Aria storage engine, which does a good job of exploiting RAM buffers.
If you do use the MEMORY storage engine, be sure to tweak your max_heap_table_size system variable. If your read queries will use index range scans (sequential index access) be sure to specify BTREE style indexes. See here: http://dev.mysql.com/doc/refman/5.1/en/memory-storage-engine.html

Varbinary vs Blob in MySQL

I have about 2k of raw binary data that I need to store in a table, but don't know whether to choose the Varbinary or Blob type. I have read through the descriptions in the MySQL docs but didn't find any contract and compare descriptions. I also read that varbinary only supports up to 255 characters, but I successfully created a varbinary(2048) field, so I'm a bit confused.
The binary data does not need to be indexed, nor will I need to query on it. Is there an advantage to using one type over the other from PHP?
Thanks!
VARBINARY is bound to 255 bytes on MySQL 5.0.2 and below, to 65kB on 5.0.3 and above.
BLOB is bound to 65kB.
Ultimately, VARBINARY is virtually the same as BLOB (from the perspective of what can be stored in it), unless you want to preserve compatibility with "old" versions of MySQL. The MySQL Documentation says:
In most respects, you can regard a BLOB column as a VARBINARY column that can be as large as you like.
Actually blob can be bigger (there are tinyblob, blob, mediumblob & longblob http://dev.mysql.com/doc/refman/5.0/en/storage-requirements.html) with up to 2^32 -1 on size limit.
Also blob storage grows "outside" of the row, while max varbinary size is tied by amount of free row size available (so it can actually be less than 64Kb).
There are some minor differences between both
1) With Index scripting (blob needs a prefix size on indexes,
varbinary doesn't) http:/en/column-indexes.html
CREATE TABLE test (blob_col BLOB, INDEX(blob_col(10)));
2) As already mentioned there are trailling space issues managed
differently between varbinary & blob at MySql 5.0.x or earlier
versions: http:///en/blob.html
http:///en/binary-varbinary.html
(truncating the links, since stackoverflow thinks too many links are spam)
One significant difference is blob types are stored in secondary storage, while varbinaries are stored inline in the row in the same way as varchars and other "simple" types.
This can have an impact on performance in a busy system, where the additional lookup to fetch and manipulate the blob data can be expensive.
It is worth to point that Memory storage engine does not support BLOB/TEXT but it works with VARBINARY.
I am just looking at a test app that stores around 5k binary data in a column. It initially used varbinary but since it is so slow I decided to try blob. Well I'm looking at disk write speed with atop and can't see any difference.
The only significant difference I read in mysql manual is that blobs are unsupported by the memory engine so any temporary tables you create with queries (see when mysql uses temp tables) will be created on-disk and that is much slower.
So you better bet on varbinary/binary if it is a short enough to fit into a row (at the moment 64k total for all columns).

Is InnoDB (MySQL 5.5.8) the right choice for multi-billion rows?

So, one of my tables in MySQL which uses the InnoDB storage engine will contain multi-billion rows(with potentially no limit to how many will be inserted).
Can you tell me what sort of optimizations i can do to help speed up things?
Cause with a few million rows already, it will start getting slow.
Of course if you suggest to use something else. The only options i have are PostgreSQL and Sqlite3. But I've been told that sqlite3 is not a good choice for that.
As for postgresql, i have absolutely no idea how it is, as i've never used it.
I imagine though, at least about 1000-1500 inserts per second in that table.
A simple answer to your question would be yes InnoDB would be the perfect choice for a multi-billion row data set.
There is a host of optimization that is possbile.
The most obvious optimizations would be setting a large buffer pool, as buffer pool is the single most important thing when it comes to InnoDB because InnoDB buffers the data as well as the index in the buffer pool. If you have a dedicated MySQL server with only InnoDB tables, then you should set upto 80% of the available RAM to be used by InnoDB.
Another most important optimization is having proper indexes on the table (keeping in mind the data access/update pattern), both primary and secondary. (Remember that primary indexes are automatically appended to secondary indexes).
With InnoDB there are some extra goodies, such as protection from data corruption, auto-recovery etc.
As for increasing write-performance, you should setup your transaction log files to be upto a total of 4G.
One other thing that you can do is partition the table.
You can eek out more performance, by setting the bin-log-format to "row", and setting the auto_inc_lock_mode to 2 (that will ensure that innodb does not hold table level locks when inserting into auto-increment columns).
If you need any specific advice you can contact me, I would be more than willing to help.
optimizations
Take care not to have too many indexes. They are expensive when inserting
Make your datatypes fit your data, as tight fit you can. (so don't go saving ip-adresses in a text or a blob, if you know what i mean). Look in to varchar vs char. Don't forget that because varchar is more flexible, you are trading in some things. If you know a lot about your data it might help to use char's, or it might be clearly better to use varchars. etc.
Do you read at all from this table? If so, you might want to do all the reading from a replicated slave, although your connection should be good enough for that amount of data.
If you have big inserts (aside from the number of inserts), make sure your IO is actually quick enough to handle the load.
I don't think there is any reason MySQL wouldn't support this. Things that can slow you down from "thousands" to "millions" to "billions" are stuff like aforementioned indexes. There is -as far as i know- no "mysql is full" problem.
Look into Partial indexes. From wikipedia (quickest source I could find, didn't check the references, but I'm sure you can manage:)
MySQL as of version 5.4 does not
support partial indexes.[3] In MySQL,
the term "partial index" is sometimes
used to refer to prefix indexes, where
only a truncated prefix of each value
is stored in the index. This is
another technique for reducing index
size.[4]
No idea on the MySQL/InnoDB part (I'd assume it'll cope). But if you end up looking at alternatives, PostgreSQL can manage a DB of unlimited size on paper. (At least one 32TB database exists according to the FAQ.)
Can you tell me what sort of optimizations i can do to help speed up things?
Your milage will vary depending on your application. But with billions of rows, you're at least looking into partitioning your data, in order to work on smaller tables.
In the case of PostgreSQL, you'd also look into creating partial indexes where appropriate.
You may want to have a look at:
http://www.mysqlperformanceblog.com/2006/06/09/why-mysql-could-be-slow-with-large-tables/
http://forums.whirlpool.net.au/archive/954126
If you have a very large table (Billions of records) and need to data mine the table (queries that read lots of data), mysql can slow to a crawl.
Large databases (200+GB) are fine, but they are bound by IO/ temp table to disk and multiple other issues when attempting to read large groups that don't fit in memory.

Postgresql vs. MySQL: how do their data sizes compare to each other?

For the same data set, with mostly text data, how do the data (table + index) size of Postgresql compared to that of MySQL?
Postgresql uses MVCC, that would suggest its data size would be bigger
In this presentation, the largest blog site in Japan talked about their migration from Postgresql to MySQL. One of their reasons for moving away from Postgresql was that data size in Postgresql was too large (p. 41):
Migrating from PostgreSQL to MySQL at Cocolog, Japan's Largest Blog Community
Postgresql has data compression, so that should make the data size smaller. But MySQL Plugin also has compression.
Does anyone have any actual experience about how the data sizes of Postgresql & MySQL compare to each other?
MySQL uses MVCC as well, just check
innoDB. But, in PostgreSQL you can
change the FILLFACTOR to make space
for future updates. With this, you
can create a database that has space
for current data but also for some
future updates and deletes. When
autovacuum and HOT do their things
right, the size of your database can
be stable.
The blog is about old versions, a lot
of things have changed and PostgreSQL
does a much better job in compression
as it did in the old days.
Compression depends on the datatype,
configuration and speed as well. You
have to test to see how it's working
for you situation.
I did a couple of conversions from MySQL to PostgreSQL and in all these cases, PostgreSQL was about 10% smaller (MySQL 5.0 => PostgreSQL 8.3 and 8.4). This 10% was used to change the fillfactor on the most updated tables, these were set to a fillfactor 60 to 70. Speed was much better (no more problems with over 20 concurrent users) and data size was stable as well, no MVCC going out of control or vacuum to far behind.
MySQL and PostgreSQL are two different beasts, PostgreSQL is all about reliability where MySQL is populair.
Both have their storage requirements in their respective documentation:
MySQL: http://dev.mysql.com/doc/refman/5.1/en/storage-requirements.html
Postgres: http://www.postgresql.org/docs/current/interactive/datatype.html
A quick comparison of the two don't show any flagrant "zomg PostGres requires 2 megabytes to store a bit field" type differences. I suppose Postgres could have higher metadata overhead than MySQL, or has to extend its data files in larger chunks, but I can't find anything obvious that Postgres "wastes" space for which migrating to MySQL is the cure.
I'd like to add that for large columns stores, postgresql also takes advantage of compressing them using a "fairly simple and very fast member of the LZ family of compression techniques"
To read more about this, check out http://www.postgresql.org/docs/9.0/static/storage-toast.html
It's rather low-level and probably not necessary to know, but since you're using a blog, you may benefit from it.
About indexes,
MySQL stores the data within the index which makes them huge. Postgres doesn't. This means that the storage size of a b-tree index in Postgres doesn't depend on the number of the column it spans or which data type the column has.
Postgres also supports partial indexes (e.g. WHERE status=0) which is a very powerful feature to prevent building indexes over millions of rows when only a few hundred is needed.
Since you're going to put a lot of data in Postgres you will probably find it practical to be able to create indexes without locking the table.
Sent from my iPhone. Sorry for bad spelling and lack of references