mysql json vs mongo - storage space - mysql

I am experiencing an interesting situation and although is not an actual problem, I can't understand why this is happening.
We had a mongo database, consisting mainly of some bulk data stored into an array. Due to the fact that over 90% of the team was familiar with mysql while only a few of us were familiar with mongo, combined with the fact that is not a critical db and all queries are done over 2 of the fields (client or product) we decided to move the data in mysql, in a table like this
[idProduct (bigint unsigned), idClient (bigint unsigned), data (json)]
Where data is a huge json containing hundreds of attributes and their values.
We also partitioned in 100 partitions by a hash over idClient.
PARTITION BY HASH(idClient)
PARTITIONS 100;
All is working fine but I noticed an interesting fact:
The original mongo db had about 70 GB, give or take. The mysql version (containing actually less data because re removed some duplicates that we were using as indexes in mongo) has over 400 GB.
Why does it take so much more space? In theory bson should actually be slightly larger than json (at least in most cases). Even if indexes are larger in mysql... the difference is huge (over 5x).

I did a presentation How to Use JSON in MySQL Wrong (video), in which I imported Stack Overflow data dump into JSON columns in MySQL. I found the data I tested with took 2x to 3x times more space than importing the same data into normal tables and columns using conventional data types for each column.
JSON uses more space for the same data, for example because it stores integers and dates as strings, and also because it stores key names on every row, instead of just once in the table header.
That's comparing JSON in MySQL vs. normal columns in MySQL. I'm not sure how MongoDB stores data and why it's so much smaller. I have read that MongoDB's WiredTiger engine supports options for compression, and snappy compression is enabled by default since MongoDB 3.0. Maybe you should enable compressed format in MySQL and see if that gives you better storage efficiency.
JSON in MySQL is stored like TEXT/BLOB data, in that it gets mapped into a set of 16KB pages. Pages are allocated one at a time for the first 32 pages (that is, up to 512KB). If the content is longer than that, further allocation is done in increments of 64 pages (1MB). So it's possible if a single TEXT/BLOB/JSON content is say, 513KB, it would allocate 1.5MB.

Hi I think the main reason could be due to the fact that internally Mongo stores json as bson ( http://bsonspec.org/ ) and in the spec it is stressed that this representation is Lightweight.

The WiredTiger Storage Engine in MongoDB uses compression by default. I don't know the default behavior of MySQL.
Unlike MySQL, the MongoDB is designed to store JSON/BSON, actually it does not store anything else. So, this kind of "competition" might be a bit unfair for MySQL which stores JSON like TEXT/BLOB data.
If you would have relational data, i.e. column-based values then most likely MySQL would be smaller as stated by #Bill Karwin. However, with smart bucketing in MongoDB you can reduce the data size significantly.

Related

Best use of database for storing large scientific data sets

In my primary role, I handle laboratory testing data files that can contain upwards of 2000 parameters for each unique test condition. These files are generally stored and processed as CSV formatted files, but that becomes very unwieldy when working with 6000+ files with 100+ rows each.
I am working towards a future database storage and query solution to improve access and efficiency, but I am stymied by the row length limitation of MySQL (specifically MariaDB 5.5.60 on RHEL 7.5). I am using MYISAM instead of InnoDB, which has allowed me to get to around 1800 mostly-double formatted data fields. This version of MariaDB forces dynamic columns to be numbered, not named, and I cannot currently upgrade to MariaDB 10+ due to administrative policies.
Should I be looking at a NoSQL database for this application, or is there a better way to handle this data? How do others handle many-variable data sets, especially numeric data?
For an example of the CSV files I am trying to import, see below. The identifier I have been using is an amalgamation of TEST, RUN, TP forming a 12-digit unsigned bigint key.
Example File:
RUN ,TP ,TEST ,ANGLE ,SPEED ,...
1.000000E+00,1.000000E+00,5.480000E+03,1.234567E+01,6.345678E+04,...
Example key:
548000010001 <-- Test = 5480, Run = 1, TP = 1
I appreciate any input you have.
The complexity comes from the fact that you have to handle a huge number of data, not from the fact that they are split over many files with many rows.
Using a database storage & query system will superficially hide some of this complexity, but at the expense of complexity at several other levels, as you have already experienced, including obstacles that are out of your control like changing versions and conservative admins. Database storage & query system are made for other application scenarios where they have advantages that are not pertinent for your case.
You should seriously reconsider leaving your data in files, i.e. use your file system as your database storage system. Possibly, transcribe you CSV input into a modern self-documenting data format like YAML or HDF5. For queries, you may be better off writing scripts or programs that directly access those files, instead of writing SQL queries.

Why do we have Redis when we have MySQL temporary tables?

MySQL temporary table are stored in memory as long as computer has enough RAM (and MySQL was set up accordingly). One can created any indexes for any fields.
Redis stores data in memory indexed by one key at time and in my understanding MySQL can do this job too.
Are there any things that make Redis better for storing big amount(100-200k rows) of volatile data? I can only explain the appearance of Redis that not every project has mysql inside and probably some other databases don't support temporary tables.
If I already have MySql in my project, does it make sense to put up with Redis?
Redis is like working with indexes directly. There's no ACID, SQL parser and many other things between you and the data.
It provides some basic data structures and they're specifically optimized to be held in memory, and they also have specific operations to read and modify them.
In the other hand, Redis isn't designed to query data (but you can implement very powerful and high-performant filters with SORT, SCAN, intersections and other operations) but to store the data as you're going to be consumed later. If you want to get, for example, customers sorted by 3 different criterias, you'll need to work to fill 3 different sorted sets. There're a lot of use cases with other data structures, but I would end up writing a book in an answer...
Also, one of most powerful features found in Redis is how easy can be replicated, and since its 3.0 version, it supports data sharding out-of-the-box.
About why you would need to use Redis instead of temporary tables on MySQL (and other engines which have them too) is up to you. You need to study your case and check if caching or storing data in a NoSQL storage like Redis can both outperform your actual approach and it provides you a more elegant data architecture.
By using Redis alongside the other database, you're effectively reducing the load on it. Also, when Redis is running on a different server, scaling can be performed independently on each tier.

In memory relational database

I know this question is asked multiple times in stackoverflow.
I am posting this question to find out what will be the best choice in for my design.
I have following schema for my job details.
_unique_key varchar(256) NULL
_job_handle varchar(256) NULL
_data varchar(1024) NULL
_user_id int(11) NULL
_server_ip varchar(39) NULL
_app_version varchar(256) NULL
_state int(11) NULL
_is_set_stopped bool
What operation we are doing on this table:
For each job we will be having one update and 10 select query on this table. So we need high frequency for read and write.
There are many application which are manipulating this table by doing filter on:
_unique_key
_state
is_set_stopped
_user_id
_data field size varies from 5KB to 1 MB based on type of application and user.
Application can update selective attribute.
Solution we thought:
MySQL InnoDB
I think MySQL will not scale enough due to requirement on high read and write.
MySQL In Memory Table
Problem with this solution is that
It doesn't support dynamic field size. MEMORY tables use a fixed-length row-storage format. Variable-length types such as VARCHAR are stored using a fixed length. Source http://dev.mysql.com/doc/refman/5.0/en/memory-storage-engine.html
select for .... update it will lock a entire table. I don't know will it be a problem.
Redis
Redis look likes a good choice. But I think my table is not good for key value cache server.
It support only very let's set of datatypes. I can store only string in list. I need to store fields as JSON or some other format.
If clients want to update a particular attribute they need to download full value and then do parsing of object and repush to server.
May be I am wrong is there a way to do that?
Filtering based on value will not be possible.
May be I am wrong is there a way to do that?
MySQL InnoDB on TMPFS file system
This look promising. But don't no will it scale enough similar to Redis or MySQL in memory table.
In this question, you are confusing raw performance (i.e. efficiency) with scalability. They are different concepts.
Between the InnoDB and memory engines, InnoDB is likely to be the most scalable. InnoDB supports multi-versioning concurrency control, has plenty of optimizations to deal with contention, so it will handle concurrent accesses much better than the memory engine. Even if it may be slower in some I/O bound situations.
Redis is a single-threaded server. All the operations are serialized. It has zero scalability. It does not mean it is inefficient. On the contrary, it will likely support more connections that MySQL (due to its epoll-based event loop), and more traffic (due to its very efficient lock-free implementation and in-memory data structures).
To answer your question, I would give a try to MySQL with InnoDB. If it is properly configured (no synchronous commit, enough cache buffer, etc ...), it can sustain a good throughput. And instead of running it on top on tmpfs, I would consider SSD hardware.
Now, if you prefer to use Redis (which is not a relational store btw), you can certainly do it. There is no need to systematically serialize/deserialize your data. And filtering is indeed possible, provided you can anticipate all access paths and find an adapted data structure.
For instance:
one hash object per job. The key is _unique_key. The fields of the hash should correspond to the columns of your relational table.
one set per state value
2 sets for is_set_stopped
one set per userid value
For each job insertion, you need to pipeline the following commands:
HMSET job:AAA job_handle BBB data CCC user_id DDD server_ip EEE app_version FFF state GGG is_set_stopped HHH
SADD state:GGG AAA
SADD is_set_stopped:HHH AAA
SADD user_id:DDD AAA
You can easily update any field individually provided you maintain the corresponding sets.
You can perform filtering queries by intersecting the sets. For instance:
SINTER is_set_stopped:HHH state:GGG
With Redis, the bottleneck will likely be the network, especially if the data field is large. I hope you will have more jobs of 5KB than jobs of 1MB. For instance 1000 write/s of 1 MB objects represents 8 GBits/s, probably more than what your network can sustain. This is true for both Redis and MySQL.
I suggest postgresql, it's more capable (has more features and better support for complex queries and datatypes) than mysql and has a lot of tuning options.
If you give postgresql enough memory and tune the parameters right it will cache everything in memory.
Alternatively you could also use it on tmpfs if that's your preference and use streaming replication to a on-disk database for a hard copy.
Streaming replication has 3 operating modes asyncronously, on receive, and on fsync. If you use the first one, async, you don't have to wait for a sync to disk on the replication server so any updates will be very fast with tmpfs.
Since you also seem to have a lot of text fields, another feature might help, postgresql can store a textsearch vector on a row, and you can add an index on that and update it via a trigger with the concatenated content of all the rows you are searching on. That will give you an incredible boost in performance when doing text search on multiple columns versus any way you can possibly write that in mysql.
Regardless of database you use:
You state that _data is varchar[1024], yet you say it contains 5K to 1M of data? Is this actually a blob? Even if it was a mistake in the length mysql doesn't support varchar fields longer than 65535 bytes in length! I suppose that's not updated as much as the other rows, it might be wise to separate this into two tables, one with the static data and one with the dynamic data to minimize disk access.

storing telemetry data from 10000s of nodes

I need to store telemetry data that is being generated every few minutes from over 10000 nodes (which may increase), each supplying the data over the internet to a server for logging. I'll also need to query this data from a web application.
I'm having a bit of trouble deciding what the best storage solution would be..
Each node has a unique ID, and there will be a timestamp for each packet of variables. (probably will need to be generated by the server).
The telemetry data has all of the variables in the same packet, so conceptually it could easily be stored in a single database table with a column per variable. The serial number + timestamp would suffice as a key.
The size of each telemetry packet is 64 bytes, including the device ID and timestamp. So around 100Gb+ per year.
I'd want to be able to query the data to get variables across time ranges and also store aggregate reports of this data so that I can draw graphs.
Now, how best to handle this? I'm pretty familiar with using MySQL, so I'm leaning towards this. If I were to go for MySQL would it make sense to have a separate table for each device ID? - Would this make queries much faster or would having 10000s of tables be a problem?
I don't think querying the variables from all devices in one go is going to be needed but it might be. Or should I just stick it all in a single table and use MySQL cluster if it gets really big?
Or is there a better solution? I've been looking around at some non relational databases but can't see anything that perfectly fits the bill or looks very mature. MongoDB for example would have quite a lot of size overhead per row and I don't know how efficient it would be at querying the value of a single variable across a large time range compared to MySQL. Also MySQL has been around for a while and is robust.
I'd also like it to be easy to replicate the data and back it up.
Any ideas or if anyone has done anything similar you input would be greatly appreciated!
Have you looked at time-series databases? They're designed for the use case you're describing and may actually end up being more efficient in terms of space requirements due to built-in data folding and compression.
I would recommend looking into implementations using HBase or Cassandra for raw storage as it gives you proven asynchronous replication capabilities and throughput.
HBase time-series databases:
OpenTSDB
KairosDB
Axibase Time-Series Database - my affiliation
If you want to go with MySQL, keep in mind that although it will keep on going when you throw something like a 100GB per year at it easily on modern hardware, do be advised that you cannot execute schema changes afterwards (on a live system). This means you'll have to have a good, complete database schema to begin with.
I don't know if this telemetry data might grow more features, but if they do, you don't want to have to lock your database for hours if you need to add a column or index.
However, some tools such as http://www.percona.com/doc/percona-toolkit/pt-online-schema-change.html are available nowadays which make such changes somewhat easier. No performance problems to be expected here, as long as you stay with InnoDB.
Another option might be to go with PostgreSQL, which allows you to change schemas online, and sometimes is somewhat smarter about the use of indexes. (For example, http://kb.askmonty.org/en/index-condition-pushdown is a new trick for MySQL/MariaDB, and allows you to combine two indices at query time. PostgreSQL has been doing this for a long time.)
Regarding overhead: you will be storing your 64 bytes of telemetry data in an unpacked form, probably, so your records will take more than 64 bytes on disk. Any kind of structured storage will suffer from this.
If you go with an SQL solution, backups are easy: just dump the data and you can restore it afterwards.

Postgresql vs. MySQL: how do their data sizes compare to each other?

For the same data set, with mostly text data, how do the data (table + index) size of Postgresql compared to that of MySQL?
Postgresql uses MVCC, that would suggest its data size would be bigger
In this presentation, the largest blog site in Japan talked about their migration from Postgresql to MySQL. One of their reasons for moving away from Postgresql was that data size in Postgresql was too large (p. 41):
Migrating from PostgreSQL to MySQL at Cocolog, Japan's Largest Blog Community
Postgresql has data compression, so that should make the data size smaller. But MySQL Plugin also has compression.
Does anyone have any actual experience about how the data sizes of Postgresql & MySQL compare to each other?
MySQL uses MVCC as well, just check
innoDB. But, in PostgreSQL you can
change the FILLFACTOR to make space
for future updates. With this, you
can create a database that has space
for current data but also for some
future updates and deletes. When
autovacuum and HOT do their things
right, the size of your database can
be stable.
The blog is about old versions, a lot
of things have changed and PostgreSQL
does a much better job in compression
as it did in the old days.
Compression depends on the datatype,
configuration and speed as well. You
have to test to see how it's working
for you situation.
I did a couple of conversions from MySQL to PostgreSQL and in all these cases, PostgreSQL was about 10% smaller (MySQL 5.0 => PostgreSQL 8.3 and 8.4). This 10% was used to change the fillfactor on the most updated tables, these were set to a fillfactor 60 to 70. Speed was much better (no more problems with over 20 concurrent users) and data size was stable as well, no MVCC going out of control or vacuum to far behind.
MySQL and PostgreSQL are two different beasts, PostgreSQL is all about reliability where MySQL is populair.
Both have their storage requirements in their respective documentation:
MySQL: http://dev.mysql.com/doc/refman/5.1/en/storage-requirements.html
Postgres: http://www.postgresql.org/docs/current/interactive/datatype.html
A quick comparison of the two don't show any flagrant "zomg PostGres requires 2 megabytes to store a bit field" type differences. I suppose Postgres could have higher metadata overhead than MySQL, or has to extend its data files in larger chunks, but I can't find anything obvious that Postgres "wastes" space for which migrating to MySQL is the cure.
I'd like to add that for large columns stores, postgresql also takes advantage of compressing them using a "fairly simple and very fast member of the LZ family of compression techniques"
To read more about this, check out http://www.postgresql.org/docs/9.0/static/storage-toast.html
It's rather low-level and probably not necessary to know, but since you're using a blog, you may benefit from it.
About indexes,
MySQL stores the data within the index which makes them huge. Postgres doesn't. This means that the storage size of a b-tree index in Postgres doesn't depend on the number of the column it spans or which data type the column has.
Postgres also supports partial indexes (e.g. WHERE status=0) which is a very powerful feature to prevent building indexes over millions of rows when only a few hundred is needed.
Since you're going to put a lot of data in Postgres you will probably find it practical to be able to create indexes without locking the table.
Sent from my iPhone. Sorry for bad spelling and lack of references