Efficient and scalable storage for JSON data with NoSQL databases - json

We are working on a project which should collect journal and audit data and store it in a datastore for archive purposes and some views. We are not quite sure which datastore would work for us.
we need to store small JSON documents, about 150 bytes, e.g. "audit:{timestamp: '86346512',host':'foo',username:'bar',task:'foo',result:0}" or "journal:{timestamp:'86346512',host':'foo',terminalid:1,type='bar',rc=0}"
we are expecting about one million entries per day, about 150 MB data
data will be stored and read but never modified
data should stored in an efficient way, e.g. binary format used by Apache Avro
after a retention time data may be deleted
custom queries, such as 'get audit for user and time period' or 'get journal for terminalid and time period'
replicated data base for failsafe
scalable
Currently we are evaluating NoSQL databases like Hadoop/Hbase, CouchDB, MongoDB and Cassandra. Are these databases the right datastore for us? Which of them would fit best?
Are there better options?

One million inserts / day is about 10 inserts / second. Most databases can deal with this, and its well below the max insertion rate we get from Cassandra on reasonable hardware (50k inserts / sec)
Your requirement "after a retention time data may be deleted" fits Cassandra's column TTLs nicely - when you insert data you can specify how long to keep it for, then background merge processes will drop that data when it reaches that timeout.
"data should stored in an efficient way, e.g. binary format used by Apache Avro" - Cassandra (like many other NOSQL stores) treats values as opaque byte sequences, so you can encode you values how ever you like. You could also consider decomposing the value into a series of columns, which would allow you to do more complicated queries.
custom queries, such as 'get audit for user and time period' - in Cassandra, you would model this by having the row key to be the user id and the column key being the time of the event (most likely a timeuuid). You would then use a get_slice call (or even better CQL) to satisfy this query
or 'get journal for terminalid and time period' - as above, have the row key be terminalid and column key be timestamp. One thing to note is that in Cassandra (like many join-less stores), it is typical to insert the data more than once (in different arrangements) to optimise for different queries.
Cassandra has a very sophisticate replication model, where you can specify different consistency levels per operation. Cassandra is also very scalable system with no single point of failure or bottleneck. This is really the main difference between Cassandra and things like MongoDB or HBase (not that I want to start a flame!)
Having said all of this, your requirements could easily be satisfied by a more traditional database and simple master-slave replication, nothing here is too onerous

Avro supports schema evolution and is a good fit for this kind of problem.
If your system does not require low latency data loads, consider receiving the data to files in a reliable file system rather than loading directly into a live database system. Keeping a reliable file system (such as HDFS) running is simpler and less likely to have outages than a live database system. Also, separating the responsibilities ensures that your query traffic won't ever impact the data collection system.
If you will only have a handful of queries to run, you could leave the files in their native format and write custom map reduces to generate the reports you need. If you want a higher level interface, consider running Hive over the native data files. Hive will let you run arbitrary friendly SQL-like queries over your raw data files. Or, since you only have 150MB/day, you could just batch load it into MySQL readonly compressed tables.
If for some reason you need the complexity of an interactive system, HBase or Cassandra or might be good fits, but beware that you'll spend a significant amount of time playing "DBA", and 150MB/day is so little data that you probably don't need the complexity.

We're using Hadoop/HBase, and I've looked at Cassandra, and they generally use the row key as the means to retrieve data the fastest, although of course (in HBase at least) you can still have it apply filters on the column data, or do it client side. For example, in HBase, you can say "give me all rows starting from key1 up to, but not including, key2".
So if you design your keys properly, you could get everything for 1 user, or 1 host, or 1 user on 1 host, or things like that. But, it takes a properly designed key. If most of your queries need to be run with a timestamp, you could include that as part of the key, for example.
How often do you need to query the data/write the data? If you expect to run your reports and it's fine if it takes 10, 15, or more minutes (potentially), but you do a lot of small writes, then HBase w/Hadoop doing MapReduce (or using Hive or Pig as higher level query languages) would work very well.

If your JSON data has variable fields, then a schema-less model like Cassandra could suit your needs very well. I'd expand the data into columns rather then storing it in binary format, that will make it easier to query. With the given rate of data, it would take you 20 years to fill a 1 TB disk, so I wouldn't worry about compression.
For the example you gave, you could create two column families, Audit and Journal. The row keys would be TimeUUIDs (i.e. timestamp + MAC address to turn them into unique keys). Then the audit row you gave would have four columns, host:'foo', username:'bar', task:'foo', and result:0. Other rows could have different columns.
A range scan over the row keys would allow you to query efficiently over time periods (assuming you use ByteOrderedPartitioner). You could then use secondary indexes to query on users and terminals.

Related

Big quantity of data with MySql

i have a question about Mass storage. Actually, i'm working with 5 sensors which sends a lot of datas with a different frequency for each one and i'm using MySQL DATABASE.
so here is my questions:
1) is MySQL the perfect solution.
2) if not, is there a solution to store this big quantity of data in a data base?
3) I'm using Threads in this and i'm using mutexs also, i'm afraid if this can cause problems, Actually,it seems to be.
i hope i will have an answer to this question.
MySql is good solution for OLTP scenarios where you are storing transactions to serve web or mobile apps. But it does not scale well (despite of cluster abilities).
There are many options out there based on what is important to you:
File System: You can device your own write-ahead-log solution to solve multi-threading problems and achieve "eventual consistency". That way you don't have to lock data for one thread at a time. You can use schema-full files like CSV, Avro or Parquet. Also you can use S3 or WSB for cloud based block storage. Or HDFS for just block and replicated storage.
NoSql: You can store each entry as document in NoSql Document stores. If you want to keep data in memory for faster read, explore Memcached or Redis. If you want to perform searches on data, use Solr or ElasticSearch. MongoDB is popular but it has scalability issues similar to MySql, instead I would chose Cassandra or HBase if you need more scalability. With some of NoSql stores, you might have to parse your "documents" at read time which may impact analytics performance.
RDBMS: As MySql is not scalable enough, you might explore Teradata and Oracle. Latest version of Oracle offers petabyte query capabilities and in-memory caching.
Using a database can add extra computation overhead if you have a "lot of data". Another question is what you do with the data? If you only stack them, a map/vector can be enough.
The first step is maybe to use map/vector that you can serialize to a file when needed. Second you can add the database if you wish.
About mutex if you share some code with different thread and if (in this code) you work on the same data at the same time, then you need them. Otherwise remove them. BTW if you can separate read and write operations then you don't need mutex/semaphore mechanism.
You can store data anywhere, but the data storage structure selection would depends on the use cases (the things, you want to do with the data).
It could be HDFS files, RDBMS, NoSQL DB, etc.
For example your common could be:
1. to save the sensor data very quickly.
2. get the sensor data on the definite date.
Then, you can use MongoDB or Cassandra.
If you want to get deep analytics (to get monthly average sensor data), you definitely should think about another solutions.
As for MySQL, it could also be used for some reasonable big data storage,
as it supports sharding. It fits some scenarios well, some not.
But I repeat, all would depend on use cases, i.e. the things you want to do with data.
So you could provide question with more details (define desired use-cases), or ask again.
There are several Questions that discuss "lots of data" and [mysql]. They generally say "yes, but it depends on what you will do with it".
Some general statements (YMMV):
a million rows -- no problem.
a billion rows or a terabyte of data -- You will run into problems, but they are no insurmountable.
100 inserts per second on spinning disk -- probably no problem
1000 rows/second inserted can be done; troubles are surmountable
creating "reports" from huge tables is problematical until you employ Summary Tables.
Two threads storing into the same table at the "same" time? Every RDBMS (MySQL included) solves that problem before the first release. The Mutexes (or whatever) are built into the code; you don't have to worry.
"Real time" -- If you are inserting 100 sensor values per second and comparing each value to one other value: No problem. Comparing to a million other values: big problem with any system.
"5 sensors" -- Read each hour? Yawn. Each minute? Yawn. Each second? Probably still Yawn. We need more concrete numbers to help you!

Using Redis to cache data that is in use in a Real TIme Single Page App

I've got a web application, it has the normal feature, user settings etc these are all stored in MYSQL with the user etc.....
A particular part of the application a is a table of data for the user to edit.
I would like to make this table real time, across multiple users. Ie multiple users can open the page edit the data and see changes in real time done by other users editing the table.
My thinking is to cache the data for the table in Redis, then preform all the actions in redis like keeping all the clients up to date.
Once all the connection have closed for a particular table save the data back to mysql for persistence, I know Redis can be used as a persistent NoSQL database but as RAM is limited and all my other data is stored in MYSQL, mysql seems a better option.
Is this a correct use case for redis? Is my thinking correct?
It depends on the scalability. The number of records you are going to deal with and the structure you are going to use for saving it.
I will discuss about pros and crons of using redis. The decision is up to you.
Advantages of using redis:
1) It can handle heavy writes and reads in comparison with MYSQL
2) It has flexible structures (hashmap, sorted set etc) which can
localise your writes instead of blocking the whole table.
3) Read queries will be much faster as it is served from cache.
Disadvantages of using redis:
1) Maintaining transactions. What happens if both users try to access a
particular cell at a time? Do you have a right data structure in redis to
handle this case?
2) What if the data is huge? It will exceed the memory limit.
3) What happens if there is a outage?
4) If you plan for persistence of redis. Say using RDB or AOF. Will you
handle those 5-10 seconds of downtime?
Things to be focussed:
1) How much data you are going to deal with? Assume for a table of 10000 rows wit 10 columns in redis takes 1 GB of memory (Just an assumption actual memory will be very much less). If your redis is 10GB cluster then you can handle only 10 such tables. Do a math of about how many rows * column * live tables you are going to work with and the memory it consumes.
2) Redis uses compression for data within a range http://redis.io/topics/memory-optimization. Let us say you decide to save the table with a hashmap, you have two options, for each column you can have a hashmap or for each row you can have a hashmap. Second option will be the optimal one. because storing 1000 (hashmaps -> rows) * 20 (records in each hash map -> columns) will take 10 time less memory than storing in the other way. Also in this way if a cell is changed you can localize in hashmap of within 20 values.
3) Loading the data back in your MYSQL. how often will this going to happen? If your work load is high then MYSQL begins to perform worse for other operations.
4) How are you going to deal with multiple clients on notifying the changes? Will you load the whole table or the part which is changes? Loading the changed part will be the optimal one. In this case, where will you maintain the list of cells which have been altered?
Evaluate your system with these questions and you will find whether it is feasible or not.

MySQL Cluster and big blobs

I decided to use a MySQL Cluster for a bigger project of mine. Beside storing documents in a simple table scheme with only three indexes, a need to store information in the size of 1MB to 50MB arise. Those informations will be serialized custom tables being aggregats of data feeds.
How will be those information be stored and how many nodes will those information hit? I understand that with a replication factor of three those information will be written three times and I understand that there are coordinator nodes (named differently) so I ask myself what will be the impact storing those information?
Is it right that I understand that for a read a cluster will send those blobs to three servers (one requested the information, one coordinator and one data server) and for a write it is 5 (1+1+3)?
Generally speaking MySQL only supports NoOfReplicas=2 right now, using 3 or 4 is generally not supported and not very well tested, this is noted here:
http://dev.mysql.com/doc/refman/5.6/en/mysql-cluster-ndbd-definition.html#ndbparam-ndbd-noofreplicas
"The maximum possible value is 4; currently, only the values 1 and 2 are actually supported."
As also described in the above URL, the data is stored with the same number of replicas as this setting. So with NoOfReplicas=2, you get 2 copies. These are stored on the ndbd (or ndbmtd) nodes, the management nodes (ndb_mgmd) act as co-ordinators and the source of configuration, they do not store any data and neither does the mysqld node.
If you had 4 data nodes, you would have your entire dataset split in half and then each half is stored on 2 of the 4 data nodes. If you had 8 data nodes, your entire data set would be split into four parts and then each part stored on 2 of the 8 data nodes.
This process is sometimes known as "Partitioning". When a query runs, the data is split up and sent to each partition which processes it locally as much as possible (for example by removing non-matching rows using indexes, this is called engine condition pushdown, see http://dev.mysql.com/doc/refman/5.6/en/condition-pushdown-optimization.html) and then it is aggregated in mysqld for final processing (may include calculations, joins, sorting, etc) and return to the client. The ndb_mgmd nodes do not get involved in the actual data processing in any way.
Data is by default partitioned by the PRIMARY KEY, but you can change this to partition by other columns. Some people use this to ensure that a given query is only processed on a single data node much of the time, for example by partitioning a table to ensure all rows for the same customer are on a single data node rather than spread across them. This may be better, or worse, depending on what you are trying to do.
You can read more about data partitioning and replication here:
http://dev.mysql.com/doc/refman/5.6/en/mysql-cluster-nodes-groups.html
Note that MySQL Cluster is really not ideal for storing such large data, in any case you will likely need to tune some settings and try hard to keep your transactions small. There are some specific extra limitations/implications of using BLOB which you can find discussed here:
http://dev.mysql.com/doc/mysql-cluster-excerpt/5.6/en/mysql-cluster-limitations-transactions.html
I would run comprehensive tests to ensure it is performing well under high load if you go ahead and ensure you setup good monitoring and test your failure scenarios.
Lastly, I would also strongly recommend getting pre-sales support and a support contract from Oracle, as MySQL Cluster is quite a complicated product and needs to be configured and used correctly to get the best out of it. In the interest of disclosure, I work for Oracle in MySQL Support -- so you can take that recommendation as either biased or very well informed.

Redis vs MySQL for Financial Data?

I realize that this question is pretty well discussed, however I would like to get your input in the context of my specific needs.
I am developing a realtime financial database that grabs stock quotes from the net multiple times a minute and stores it in a database. I am currently working with SQLAlchemy over MySQL, but I came across Redis and it looks interesting. It looks good especially because of its performance, which is crucial in my application. I know that MySQL can be fast too, I just feel like implementing heavy caching is going to be a pain.
The data I am saving is by far mostly decimal values. I am also doing a significant amount of divisions and multiplications with these decimal values (in a different application).
In terms of data size, I am grabbing about 10,000 symbols multiple times a minute. This amounts to about 3 TB of data a year.
I am also concerned by Redis's key quantity limitation (2^32). Is Redis a good solution here? What other factors can help me make the decision either toward MySQL or Redis?
Thank you!
Redis is an in-memory store. All the data must fit in memory. So except if you have 3 TB of RAM per year of data, it is not the right option. The 2^32 limit is not really an issue in practice, because you would probably have to shard your data anyway (i.e. use multiple instances), and because the limit is actually 2^32 keys with 2^32 items per key.
If you have enough memory and still want to use (sharded) Redis, here is how you can store space efficient time series: https://github.com/antirez/redis-timeseries
You may also want to patch Redis in order to add a proper time series data structure. See Luca Sbardella's implementation at:
https://github.com/lsbardel/redis
http://lsbardel.github.com/python-stdnet/contrib/redis_timeseries.html
Redis is excellent to aggregate statistics in real time and store the result of these caclulations (i.e. DIRT applications). However, storing historical data in Redis is much less interesting, since it offers no query language to perform offline calculations on these data. Btree based stores supporting sharding (MongoDB for instance) are probably more convenient than Redis to store large time series.
Traditional relational databases are not so bad to store time series. People have dedicated entire books to this topic:
Developing Time-Oriented Database Applications in SQL
Another option you may want to consider is using a bigdata solution:
storing massive ordered time series data in bigtable derivatives
IMO the main point (whatever the storage engine) is to evaluate the access patterns to these data. What do you want to use these data for? How will you access these data once they have been stored? Do you need to retrieve all the data related to a given symbol? Do you need to retrieve the evolution of several symbols in a given time range? Do you need to correlate values of different symbols by time? etc ...
My advice is to try to list all these access patterns. The choice of a given storage mechanism will only be a consequence of this analysis.
Regarding MySQL usage, I would definitely consider table partitioning because of the volume of the data. Depending on the access patterns, I would also consider the ARCHIVE engine. This engine stores data in compressed flat files. It is space efficient. It can be used with partitioning, so despite it does not index the data, it can be efficient at retrieving a subset of data if the partition granularity is carefully chosen.
You should consider Cassandra or Hbase. Both allow contiguous storage and fast appends, so that when it comes to querying, you get huge performance. Both will easily ingest tens of thousands of points per second.
The key point is along one of your query dimensions (usually by ticker), you're accessing disk (ssd or spinning), contiguously. You're not having to hit indices millions of times. You can model things in Mongo/SQL to get similar performance, but it's more hassle, and you get it "for free" out of the box with the columnar guys, without having to do any client side shenanigans to merge blobs together.
My experience with Cassandra is that it's 10x faster than MongoDB, which is already much faster than most relational databases, for the time series use case, and as data size grows, its advantage over the others grows too. That's true even on a single machine. Here is where you should start.
The only negative on Cassandra at least is that you don't have consistency for a few seconds sometimes if you have a big cluster, so you need either to force it, slowing it down, or you accept that the very very latest print sometimes will be a few seconds old. On a single machine there will be zero consistency problems, and you'll get the same columnar benefits.
Less familiar with Hbase but it claims to be more consistent (there will be a cost elsewhere - CAP theorem), but it's much more of a commitment to setup the Hbase stack.
You should first check the features that Redis offers in terms of data selection and aggregation. Compared to an SQL database, Redis is limited.
In fact, 'Redis vs MySQL' is usually not the right question, since they are apples and pears. If you are refreshing the data in your database (also removing regularly), check out MySQL partitioning. See e.g. the answer I wrote to What is the best way to delete old rows from MySQL on a rolling basis?
>
Check out MySQL Partitioning:
Data that loses its usefulness can often be easily removed from a partitioned table by dropping the partition (or partitions) containing only that data. Conversely, the process of adding new data can in some cases be greatly facilitated by adding one or more new partitions for storing specifically that data.
See e.g. this post to get some ideas on how to apply it:
Using Partitioning and Event Scheduler to Prune Archive Tables
And this one:
Partitioning by dates: the quick how-to

handling/compressing large datasets in multiple tables

In an application at our company we collect statistical data from our servers (load, disk usage and so on). Since there is a huge amount of data and we don't need all data at all times we've had a "compression" routine that takes the raw data and calculates min. max and average for a number of data-points, store these new values in the same table and removes the old ones after some weeks.
Now I'm tasked with rewriting this compression routine and the new routine must keep all uncompressed data we have for one year in one table and "compressed" data in another table. My main concerns now are how to handle the data that is continuously written to the database and whether or not to use a "transaction table" (my own term since I cant come up with a better one, I'm not talking about the commit/rollback transaction functionality).
As of now our data collectors insert all information into a table named ovak_result and the compressed data will end up in ovak_resultcompressed. But are there any specific benefits or drawbacks to creating a table called ovak_resultuncompressed and just use ovak_result as a "temporary storage"? ovak_result would be kept minimal which would be good for the compressing routine, but I would need to shuffle all data from one table into another continually, and there would be constant reading, writing and deleting in ovak_result.
Are there any mechanisms in MySQL to handle these kind of things?
(Please note: We are talking about quite large datasets here (about 100 M rows in the uncompressed table and about 1-10 M rows in the compressed table). Also, I can do pretty much what I want with both software and hardware configurations so if you have any hints or ideas involving MySQL configurations or hardware set-up, just bring them on.)
Try reading about the ARCHIVE storage engine.
Re your clarification. Okay, I didn't get what you meant from your description. Reading more carefully, I see you did mention min, max, and average.
So what you want is a materialized view that updates aggregate calculations for a large dataset. Some RDBMS brands such as Oracle have this feature, but MySQL doesn't.
One experimental product that tries to solve this is called FlexViews (http://code.google.com/p/flexviews/). This is an open-source companion tool for MySQL. You define a query as a view against your raw dataset, and FlexViews continually monitors the MySQL binary logs, and when it sees relevant changes, it updates just the rows in the view that need to be updated.
It's pretty effective, but it has a few limitations in the types of queries you can use as your view, and it's also implemented in PHP code, so it's not fast enough to keep up if you have really high traffic updating your base table.