Best use of database for storing large scientific data sets - mysql

In my primary role, I handle laboratory testing data files that can contain upwards of 2000 parameters for each unique test condition. These files are generally stored and processed as CSV formatted files, but that becomes very unwieldy when working with 6000+ files with 100+ rows each.
I am working towards a future database storage and query solution to improve access and efficiency, but I am stymied by the row length limitation of MySQL (specifically MariaDB 5.5.60 on RHEL 7.5). I am using MYISAM instead of InnoDB, which has allowed me to get to around 1800 mostly-double formatted data fields. This version of MariaDB forces dynamic columns to be numbered, not named, and I cannot currently upgrade to MariaDB 10+ due to administrative policies.
Should I be looking at a NoSQL database for this application, or is there a better way to handle this data? How do others handle many-variable data sets, especially numeric data?
For an example of the CSV files I am trying to import, see below. The identifier I have been using is an amalgamation of TEST, RUN, TP forming a 12-digit unsigned bigint key.
Example File:
RUN ,TP ,TEST ,ANGLE ,SPEED ,...
1.000000E+00,1.000000E+00,5.480000E+03,1.234567E+01,6.345678E+04,...
Example key:
548000010001 <-- Test = 5480, Run = 1, TP = 1
I appreciate any input you have.

The complexity comes from the fact that you have to handle a huge number of data, not from the fact that they are split over many files with many rows.
Using a database storage & query system will superficially hide some of this complexity, but at the expense of complexity at several other levels, as you have already experienced, including obstacles that are out of your control like changing versions and conservative admins. Database storage & query system are made for other application scenarios where they have advantages that are not pertinent for your case.
You should seriously reconsider leaving your data in files, i.e. use your file system as your database storage system. Possibly, transcribe you CSV input into a modern self-documenting data format like YAML or HDF5. For queries, you may be better off writing scripts or programs that directly access those files, instead of writing SQL queries.

Related

mysql json vs mongo - storage space

I am experiencing an interesting situation and although is not an actual problem, I can't understand why this is happening.
We had a mongo database, consisting mainly of some bulk data stored into an array. Due to the fact that over 90% of the team was familiar with mysql while only a few of us were familiar with mongo, combined with the fact that is not a critical db and all queries are done over 2 of the fields (client or product) we decided to move the data in mysql, in a table like this
[idProduct (bigint unsigned), idClient (bigint unsigned), data (json)]
Where data is a huge json containing hundreds of attributes and their values.
We also partitioned in 100 partitions by a hash over idClient.
PARTITION BY HASH(idClient)
PARTITIONS 100;
All is working fine but I noticed an interesting fact:
The original mongo db had about 70 GB, give or take. The mysql version (containing actually less data because re removed some duplicates that we were using as indexes in mongo) has over 400 GB.
Why does it take so much more space? In theory bson should actually be slightly larger than json (at least in most cases). Even if indexes are larger in mysql... the difference is huge (over 5x).
I did a presentation How to Use JSON in MySQL Wrong (video), in which I imported Stack Overflow data dump into JSON columns in MySQL. I found the data I tested with took 2x to 3x times more space than importing the same data into normal tables and columns using conventional data types for each column.
JSON uses more space for the same data, for example because it stores integers and dates as strings, and also because it stores key names on every row, instead of just once in the table header.
That's comparing JSON in MySQL vs. normal columns in MySQL. I'm not sure how MongoDB stores data and why it's so much smaller. I have read that MongoDB's WiredTiger engine supports options for compression, and snappy compression is enabled by default since MongoDB 3.0. Maybe you should enable compressed format in MySQL and see if that gives you better storage efficiency.
JSON in MySQL is stored like TEXT/BLOB data, in that it gets mapped into a set of 16KB pages. Pages are allocated one at a time for the first 32 pages (that is, up to 512KB). If the content is longer than that, further allocation is done in increments of 64 pages (1MB). So it's possible if a single TEXT/BLOB/JSON content is say, 513KB, it would allocate 1.5MB.
Hi I think the main reason could be due to the fact that internally Mongo stores json as bson ( http://bsonspec.org/ ) and in the spec it is stressed that this representation is Lightweight.
The WiredTiger Storage Engine in MongoDB uses compression by default. I don't know the default behavior of MySQL.
Unlike MySQL, the MongoDB is designed to store JSON/BSON, actually it does not store anything else. So, this kind of "competition" might be a bit unfair for MySQL which stores JSON like TEXT/BLOB data.
If you would have relational data, i.e. column-based values then most likely MySQL would be smaller as stated by #Bill Karwin. However, with smart bucketing in MongoDB you can reduce the data size significantly.

CSV vs database?

I’ve read a few csv vs database debates and in many cased people recommended db solution over csv.
However it has never been exactly the same setup I have.
So here is the setup.
- Every hour around 50 csv files generated representing performance group from around 100 hosts
- Each performance group has from 20 to 100 counters
- I need to extract data to create a number of predefined reports (e.g. daily for certain counters and nodes) - this should be relatively static
- I need to extract data add-hoc when needed (e.g. for investigation purposes) based on variable time period, host, counter
- In total around 100MB a day (in all 50 files)
Possible solutions?
1) Keep it in csv
- To create a master csv file for each performance group and every hour just append the latest csv file
To generate my reports using just scripts with shell commands (grep, sed, cut, awk)
2) Load it to database (e.g. MySQL)
- To create tables mirroring performance groups and load those csv files into the tables
To generate my reports using sql queries
When I tried simulate and to use just shell commands on csv files and it was very fast.
I worry that database queries would be slower (considering the amount of data).
I also know that databases don’t like too wide tables – in my scenario I would need in some cases 100+ columns.
It will be read only for most of time (only appending new files).
I’d like to keep data for a year so it would be around 36GB. Would the database solution still perform ok (1-2 core VM, 2-4GB memory expected).
I haven’t simulate the database solution that’s why I’d like to ask you if you have any view/experience with similar scenario.

storing telemetry data from 10000s of nodes

I need to store telemetry data that is being generated every few minutes from over 10000 nodes (which may increase), each supplying the data over the internet to a server for logging. I'll also need to query this data from a web application.
I'm having a bit of trouble deciding what the best storage solution would be..
Each node has a unique ID, and there will be a timestamp for each packet of variables. (probably will need to be generated by the server).
The telemetry data has all of the variables in the same packet, so conceptually it could easily be stored in a single database table with a column per variable. The serial number + timestamp would suffice as a key.
The size of each telemetry packet is 64 bytes, including the device ID and timestamp. So around 100Gb+ per year.
I'd want to be able to query the data to get variables across time ranges and also store aggregate reports of this data so that I can draw graphs.
Now, how best to handle this? I'm pretty familiar with using MySQL, so I'm leaning towards this. If I were to go for MySQL would it make sense to have a separate table for each device ID? - Would this make queries much faster or would having 10000s of tables be a problem?
I don't think querying the variables from all devices in one go is going to be needed but it might be. Or should I just stick it all in a single table and use MySQL cluster if it gets really big?
Or is there a better solution? I've been looking around at some non relational databases but can't see anything that perfectly fits the bill or looks very mature. MongoDB for example would have quite a lot of size overhead per row and I don't know how efficient it would be at querying the value of a single variable across a large time range compared to MySQL. Also MySQL has been around for a while and is robust.
I'd also like it to be easy to replicate the data and back it up.
Any ideas or if anyone has done anything similar you input would be greatly appreciated!
Have you looked at time-series databases? They're designed for the use case you're describing and may actually end up being more efficient in terms of space requirements due to built-in data folding and compression.
I would recommend looking into implementations using HBase or Cassandra for raw storage as it gives you proven asynchronous replication capabilities and throughput.
HBase time-series databases:
OpenTSDB
KairosDB
Axibase Time-Series Database - my affiliation
If you want to go with MySQL, keep in mind that although it will keep on going when you throw something like a 100GB per year at it easily on modern hardware, do be advised that you cannot execute schema changes afterwards (on a live system). This means you'll have to have a good, complete database schema to begin with.
I don't know if this telemetry data might grow more features, but if they do, you don't want to have to lock your database for hours if you need to add a column or index.
However, some tools such as http://www.percona.com/doc/percona-toolkit/pt-online-schema-change.html are available nowadays which make such changes somewhat easier. No performance problems to be expected here, as long as you stay with InnoDB.
Another option might be to go with PostgreSQL, which allows you to change schemas online, and sometimes is somewhat smarter about the use of indexes. (For example, http://kb.askmonty.org/en/index-condition-pushdown is a new trick for MySQL/MariaDB, and allows you to combine two indices at query time. PostgreSQL has been doing this for a long time.)
Regarding overhead: you will be storing your 64 bytes of telemetry data in an unpacked form, probably, so your records will take more than 64 bytes on disk. Any kind of structured storage will suffer from this.
If you go with an SQL solution, backups are easy: just dump the data and you can restore it afterwards.

Can I use multiple servers to increase mysql's data upload performance?

I am in the process of setting up a mysql server to store some data but realized(after reading a bit this weekend) I might have a problem uploading the data in time.
I basically have multiple servers generating daily data and then sending it to a shared queue to process/analyze. The data is about 5 billion rows(although its very small data, an ID number in a column and a dictionary of ints in another). Most of the performance reports I have seen have shown insert speeds of 60 to 100k/second which would take over 10 hours. We need the data in very quickly so we can work on it that day and then we may discard it(or achieve the table to S3 or something).
What can I do? I have 8 servers at my disposal(in addition to the database server), can I somehow use them to make the uploads faster? At first I was thinking of using them to push data to the server at the same time but I'm also thinking maybe I can load the data onto each of them and then somehow try to merge all the separated data into one server?
I was going to use mysql with innodb(I can use any other settings it helps) but its not finalized so if mysql doesn't work is there something else that will(I have used hbase before but was looking for a mysql solution first in case I have problems seems more widely used and easier to get help)?
Wow. That is a lot of data you're loading. It's probably worth quite a bit of design thought to get this right.
Multiple mySQL server instances won't help with loading speed. What will make a difference is fast processor chips and very fast disk IO subsystems on your mySQL server. If you can use a 64-bit processor and provision it with a LOT of RAM, you may be able to use a MEMORY access method for your big table, which will be very fast indeed. (But if that will work for you, a gigantic Java HashMap may work even better.)
Ask yourself: Why do you need to stash this info in a SQL-queryable table? How will you use your data once you've loaded it? Will you run lots of queries that retrieve single rows or just a few rows of your billions? Or will you run aggregate queries (e.g. SUM(something) ... GROUP BY something_else) that grind through large fractions of the table?
Will you have to access the data while it is incompletely loaded? Or can you load up a whole batch of data before the first access?
If all your queries need to grind the whole table, then don't use any indexes. Otherwise do. But don't throw in any indexes you don't need. They are going to cost you load performance, big time.
Consider using myISAM rather than InnoDB for this table; myISAM's lack of transaction semantics makes it faster to load. myISAM will do fine at handling either aggregate queries or few-row queries.
You probably want to have a separate table for each day's data, so you can "get rid" of yesterday's data by either renaming the table or simply accessing a new table.
You should consider using the LOAD DATA INFILE command.
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
This command causes the mySQL server to read a file from the mySQL server's file system and bulk-load it directly into a table. It's way faster than doing INSERT commands from a client program on another machine. But it's also tricker to set up in production: your shared queue needs access to the mySQL server's file system to write the data files for loading.
You should consider disabling indexing, then loading the whole table, then re-enabling indexing, but only if you don't need to query partially loaded tables.

Best way to store large data in mysql

I need to store large amount of data every hour in the database. What kind of data? Text data.
What is the best way? Store on multiple table or 1 large table?
Edit: I just said, large text data. 10000 times the word "data"
Every hour a new line is added like:
hour - data
Edit 2: Just because you can't understood the question, and also i said, "EVERY HOUR", so you imagine every hour for the next 10 years a new line will be created, does not mean its not a readable question.
Use a column of datatype 'text', 'mediumtext', or 'largetext' according to your needs.
See: http://dev.mysql.com/doc/refman/5.0/en/blob.html
Alternatively, you could just output the data to a file. They are more appropriate for logging large amounts data that may not need to be accessed often - which it seems like this might be.
MySql have added many feature in MySql 5.7. Now you can do it in many way.
Oracle like Big Data is now Integrating in MySQL.
MySql have Unlocked New Big Data Insights with MySQL & Hadoop.
Soluation 1: You can use MySQL as a Document Store. There are possible to store many many object as JSON. It highly recommended and Extendable.
MySQL Document Store = (MySql + NoSql).
X Dev API will help to produce JSON with SQL and CRUD operation over X
Protocol. Also there is possible to maintain X Session.
It will be best for transparent data sanding and sharing for chat application or group Application.
Soluation 2: MySql Sysbench: Read Only is another best solution. It will be very very fast and scalable to make group chat Application.
Soluation 3: Use MySql 5.7 : InnoDB, NoSql with Memcached API which will interact directly with storage engine InnoDB. It is 6X faster than MySql 5.6.
Still Now FaceBook is using this technology. Because it is very fast.
For more details:
https://www.mysql.com/news-and-events/web-seminars/introducing-mysql-document-store/
https://dev.mysql.com/doc/refman/5.7/en/document-store-setting-up.html
About Big Data: https://www.oracle.com/big-data/index.html
https://www.youtube.com/watch?v=1Dk517M-_7o
I think it is better to use a database that is not used by anything else but whatever uses the data (as it is a lot of text data and may slow down SQL queries) and create seperate tables for each category of data.
Ad#m