RocksDB vs Cassandra - mysql

Both MyRocks (MySql) and Cassandra uses LSM architecture to store their data. So I have populated around 5 million rows in MySql with MyRocks as storage engine and also in Cassandra. In Cassandra it takes only 1.7 GB of disk space while in MySql with MyRocks as storage engine, it takes 19 GB.
Am I missing something? Both use the same LSM mechanism. But why do they differ in data size?
Update:
I guess it has something to do with the text column. My Table Structure is (bigint,bigint,varchar,text).
Rows populated: 300 000
In MyRocks the data size 185MB
In Cassandra - 13 MB.
But if I remove the text column then:
MyRocks - 21.6 MB
Cassandra - 11 MB
Any idea about this behaviour?

Well the reason for the above behaviour is due to the rocksdb_block_size set to 4kb. Due to smaller data blocks the compressor finds lesser amount of data to compress. Setting it to 16kb solved the issue. Now I get the similar data size as of cassandra.

Not 100% on MyRocks. But Cassandra is LSM and also Key value store. Which means if your column is 'null' it won't be stored on disk. Traditionally RDBMS will still consume some space (varchars, null characters pointers etc) so this may account for your lost space.
Additionally cassandra compresses data. Try:
ALTER myTable WITH compression = { 'enabled' : false };

Related

mysql json vs mongo - storage space

I am experiencing an interesting situation and although is not an actual problem, I can't understand why this is happening.
We had a mongo database, consisting mainly of some bulk data stored into an array. Due to the fact that over 90% of the team was familiar with mysql while only a few of us were familiar with mongo, combined with the fact that is not a critical db and all queries are done over 2 of the fields (client or product) we decided to move the data in mysql, in a table like this
[idProduct (bigint unsigned), idClient (bigint unsigned), data (json)]
Where data is a huge json containing hundreds of attributes and their values.
We also partitioned in 100 partitions by a hash over idClient.
PARTITION BY HASH(idClient)
PARTITIONS 100;
All is working fine but I noticed an interesting fact:
The original mongo db had about 70 GB, give or take. The mysql version (containing actually less data because re removed some duplicates that we were using as indexes in mongo) has over 400 GB.
Why does it take so much more space? In theory bson should actually be slightly larger than json (at least in most cases). Even if indexes are larger in mysql... the difference is huge (over 5x).
I did a presentation How to Use JSON in MySQL Wrong (video), in which I imported Stack Overflow data dump into JSON columns in MySQL. I found the data I tested with took 2x to 3x times more space than importing the same data into normal tables and columns using conventional data types for each column.
JSON uses more space for the same data, for example because it stores integers and dates as strings, and also because it stores key names on every row, instead of just once in the table header.
That's comparing JSON in MySQL vs. normal columns in MySQL. I'm not sure how MongoDB stores data and why it's so much smaller. I have read that MongoDB's WiredTiger engine supports options for compression, and snappy compression is enabled by default since MongoDB 3.0. Maybe you should enable compressed format in MySQL and see if that gives you better storage efficiency.
JSON in MySQL is stored like TEXT/BLOB data, in that it gets mapped into a set of 16KB pages. Pages are allocated one at a time for the first 32 pages (that is, up to 512KB). If the content is longer than that, further allocation is done in increments of 64 pages (1MB). So it's possible if a single TEXT/BLOB/JSON content is say, 513KB, it would allocate 1.5MB.
Hi I think the main reason could be due to the fact that internally Mongo stores json as bson ( http://bsonspec.org/ ) and in the spec it is stressed that this representation is Lightweight.
The WiredTiger Storage Engine in MongoDB uses compression by default. I don't know the default behavior of MySQL.
Unlike MySQL, the MongoDB is designed to store JSON/BSON, actually it does not store anything else. So, this kind of "competition" might be a bit unfair for MySQL which stores JSON like TEXT/BLOB data.
If you would have relational data, i.e. column-based values then most likely MySQL would be smaller as stated by #Bill Karwin. However, with smart bucketing in MongoDB you can reduce the data size significantly.

Is there 1 GB data storage space in any 1 table of MySQL?

We have web-based app in production with thousand of users. We analyzed embedded DBs and while reading about data storage capacity of mySQL, we come across this
Each individual table should not exceed 1 GB in size or 20 million rows
My requirement is to store BLOBS in 1 table of my mySQL DB.
If the storage capacity of mySQL is only 1 GB, then my DB will crash in production because my blobs quickly occupy 1 GB?
I don't know where you read that an individual table should not exceed 1 GB or 20 million rows. Those are definitely not size limits for MySQL tables.
See my answer in Maximum number of records in a MySQL database table
You will fill up your storage before you reach the maximum table size MySQL supports. In other words, there is no storage that exists today that is larger than the limit of a MySQL InnoDB table.
That said, there are recommended size limits. For example, if you let a table grow too large (e.g. 1TB), it will take days to do a schema change or an optimize table.
At my work, we recommend to the developers that they should keep tables under 500GB, and we warn them if it goes over 1TB. Because the larger the table gets, the harder it is to complete certain operations tasks (defragmentation, backups, restores, schema changes, etc.).
There's no specific limit or threshold, it's simply that "the bigger it is, the more time it takes."
This is true for anything that runs on a computer — not just MySQL and not just databases.

MySQL calculate RAM B+Tree's footprint for a single table (comparison with python data-struct)

I have the below data that I am caching in Python now:
id timestamp data-string
The data-string size is ~87 bytes. Storing this optimally in python (using dict and having the timestamp pre-pended to the data-str with delimiter), the RAM costing per entry comes to ~198 bytes. This is quite big for the size of the cache I need.
I would like to try out storing the same in MySQL table, to see if I can save on RAM space. While doing so, I store this as:
id timestamp data-string
4B 4B
<---- PK ---->
I understand that MySQL will load the index of the InnoDB table (that's what I have now) into RAM. Therefore, the id (unique), timestamp and a pointer to the data-string will reside on RAM.
How do I calculate the complete RAM usage (ie including the meta-data) for the B+Tree of MySQL only for this new table?
There are so many variables, padding, etc, that it is impractical to estimate how much disk space an InnoDB BTree will consume. The 2x you quote is pretty good. The buffer_pool is a cache in RAM, so you can't say that the BTree will consume as much RAM space as disk did. Caching is on 16KB blocks.
(#ec5 has good info on the current size, on disk, of the index(es).)

mysql database have larger size

I have a database and that contain many tables , but all of them store text and full tables size is less than 10 MB , but the database file on server have larger than 1 GB
The tables uses MyIASM , where is the problem ?
thanks
You can use OPTIMIZE TABLE for reducing disk usage on MyISAM tables (usually after deleting lot of data from them or doing major changes in structure).
For InnoDB, that is quite more challenging, see Howto: Clean a mysql InnoDB storage engine?

Handling of huge Blobs in MySQL?

How can I insert huge BLOBs into a MySQL database (InnoDB)?
Fields of type LONGBLOB support data sizes of up to 4 GB according to the MySQL manual. But how does data of such a huge size get into the database?
I tried to use
INSERT INTO table (bindata) VALUES ( LOAD_FILE('c:/tmp/hugefile') );
which fails if the size of hugefile is bigger than about 500 MB. I have set max_allowed_packet to an appropriate size; the value of innodb_buffer_pool_size doesn't seem to have an influence.
My server machine runs Windows Server 2003 and has 2 GB RAM. I'm using MySQL 5.0.74-enterprise-nt.
BLOBs are cached in memory, that's why you will have 3 copies of a BLOB as you are inserting it into a database.
Your 500 MB BLOB occupies 1,500 MB in RAM, which seems to hit your memory limit.
I do not know which client/api you use, but when trying to use blobs from own Java and Objective-C clients it seems, MySQL does not really support streaming of blobs. You need enough memory to hold the whole blob as byte array in ram (server and client side) more than once! Moving to a 64 bit linux helps, but is not desired solution.
MySQL ist not made for BLOB handling (ok for small BOBs :-). It occupies twice or three times the ram to be stored/read in the "BLOB".
You have to use an other database like PostgreSQL to get real BLOB support, sorry.