first of all I apologize for my English.
I have to store the following data structure:
n Nodes linked by m Edges (a Graph).
Every node has a number of attributes(the same number and type for each node).
These nodes should also be linked to another set of objects (composed by a set of meta-datas(every object could have a different number of meta-datas) and a BLOB).
the number of nodes is 1000000 circa and the number of edges is 800000000
The point is: MySql or Cassandra?
Let me know if you need more details!
Thank you in advance.
There are graph databases that are designed specifically to store graph data, Neo4j is one example.
Generally: if it has to be one of those two, do both and measure the performance. You can easily do this with both an SQL and a NoSQL approach. Also you didn't mention what queries you'll be running on the dataset (which greatly impacts the decision).
That said, nowadays I'd try to go for Cassandra whenever I have the possibility to do so, since the whole multi-node replication (and resulting fault tolerance) and all doesn't really work well with MySQL.
Related
i have a question about Mass storage. Actually, i'm working with 5 sensors which sends a lot of datas with a different frequency for each one and i'm using MySQL DATABASE.
so here is my questions:
1) is MySQL the perfect solution.
2) if not, is there a solution to store this big quantity of data in a data base?
3) I'm using Threads in this and i'm using mutexs also, i'm afraid if this can cause problems, Actually,it seems to be.
i hope i will have an answer to this question.
MySql is good solution for OLTP scenarios where you are storing transactions to serve web or mobile apps. But it does not scale well (despite of cluster abilities).
There are many options out there based on what is important to you:
File System: You can device your own write-ahead-log solution to solve multi-threading problems and achieve "eventual consistency". That way you don't have to lock data for one thread at a time. You can use schema-full files like CSV, Avro or Parquet. Also you can use S3 or WSB for cloud based block storage. Or HDFS for just block and replicated storage.
NoSql: You can store each entry as document in NoSql Document stores. If you want to keep data in memory for faster read, explore Memcached or Redis. If you want to perform searches on data, use Solr or ElasticSearch. MongoDB is popular but it has scalability issues similar to MySql, instead I would chose Cassandra or HBase if you need more scalability. With some of NoSql stores, you might have to parse your "documents" at read time which may impact analytics performance.
RDBMS: As MySql is not scalable enough, you might explore Teradata and Oracle. Latest version of Oracle offers petabyte query capabilities and in-memory caching.
Using a database can add extra computation overhead if you have a "lot of data". Another question is what you do with the data? If you only stack them, a map/vector can be enough.
The first step is maybe to use map/vector that you can serialize to a file when needed. Second you can add the database if you wish.
About mutex if you share some code with different thread and if (in this code) you work on the same data at the same time, then you need them. Otherwise remove them. BTW if you can separate read and write operations then you don't need mutex/semaphore mechanism.
You can store data anywhere, but the data storage structure selection would depends on the use cases (the things, you want to do with the data).
It could be HDFS files, RDBMS, NoSQL DB, etc.
For example your common could be:
1. to save the sensor data very quickly.
2. get the sensor data on the definite date.
Then, you can use MongoDB or Cassandra.
If you want to get deep analytics (to get monthly average sensor data), you definitely should think about another solutions.
As for MySQL, it could also be used for some reasonable big data storage,
as it supports sharding. It fits some scenarios well, some not.
But I repeat, all would depend on use cases, i.e. the things you want to do with data.
So you could provide question with more details (define desired use-cases), or ask again.
There are several Questions that discuss "lots of data" and [mysql]. They generally say "yes, but it depends on what you will do with it".
Some general statements (YMMV):
a million rows -- no problem.
a billion rows or a terabyte of data -- You will run into problems, but they are no insurmountable.
100 inserts per second on spinning disk -- probably no problem
1000 rows/second inserted can be done; troubles are surmountable
creating "reports" from huge tables is problematical until you employ Summary Tables.
Two threads storing into the same table at the "same" time? Every RDBMS (MySQL included) solves that problem before the first release. The Mutexes (or whatever) are built into the code; you don't have to worry.
"Real time" -- If you are inserting 100 sensor values per second and comparing each value to one other value: No problem. Comparing to a million other values: big problem with any system.
"5 sensors" -- Read each hour? Yawn. Each minute? Yawn. Each second? Probably still Yawn. We need more concrete numbers to help you!
I decided to use a MySQL Cluster for a bigger project of mine. Beside storing documents in a simple table scheme with only three indexes, a need to store information in the size of 1MB to 50MB arise. Those informations will be serialized custom tables being aggregats of data feeds.
How will be those information be stored and how many nodes will those information hit? I understand that with a replication factor of three those information will be written three times and I understand that there are coordinator nodes (named differently) so I ask myself what will be the impact storing those information?
Is it right that I understand that for a read a cluster will send those blobs to three servers (one requested the information, one coordinator and one data server) and for a write it is 5 (1+1+3)?
Generally speaking MySQL only supports NoOfReplicas=2 right now, using 3 or 4 is generally not supported and not very well tested, this is noted here:
http://dev.mysql.com/doc/refman/5.6/en/mysql-cluster-ndbd-definition.html#ndbparam-ndbd-noofreplicas
"The maximum possible value is 4; currently, only the values 1 and 2 are actually supported."
As also described in the above URL, the data is stored with the same number of replicas as this setting. So with NoOfReplicas=2, you get 2 copies. These are stored on the ndbd (or ndbmtd) nodes, the management nodes (ndb_mgmd) act as co-ordinators and the source of configuration, they do not store any data and neither does the mysqld node.
If you had 4 data nodes, you would have your entire dataset split in half and then each half is stored on 2 of the 4 data nodes. If you had 8 data nodes, your entire data set would be split into four parts and then each part stored on 2 of the 8 data nodes.
This process is sometimes known as "Partitioning". When a query runs, the data is split up and sent to each partition which processes it locally as much as possible (for example by removing non-matching rows using indexes, this is called engine condition pushdown, see http://dev.mysql.com/doc/refman/5.6/en/condition-pushdown-optimization.html) and then it is aggregated in mysqld for final processing (may include calculations, joins, sorting, etc) and return to the client. The ndb_mgmd nodes do not get involved in the actual data processing in any way.
Data is by default partitioned by the PRIMARY KEY, but you can change this to partition by other columns. Some people use this to ensure that a given query is only processed on a single data node much of the time, for example by partitioning a table to ensure all rows for the same customer are on a single data node rather than spread across them. This may be better, or worse, depending on what you are trying to do.
You can read more about data partitioning and replication here:
http://dev.mysql.com/doc/refman/5.6/en/mysql-cluster-nodes-groups.html
Note that MySQL Cluster is really not ideal for storing such large data, in any case you will likely need to tune some settings and try hard to keep your transactions small. There are some specific extra limitations/implications of using BLOB which you can find discussed here:
http://dev.mysql.com/doc/mysql-cluster-excerpt/5.6/en/mysql-cluster-limitations-transactions.html
I would run comprehensive tests to ensure it is performing well under high load if you go ahead and ensure you setup good monitoring and test your failure scenarios.
Lastly, I would also strongly recommend getting pre-sales support and a support contract from Oracle, as MySQL Cluster is quite a complicated product and needs to be configured and used correctly to get the best out of it. In the interest of disclosure, I work for Oracle in MySQL Support -- so you can take that recommendation as either biased or very well informed.
I realize that this question is pretty well discussed, however I would like to get your input in the context of my specific needs.
I am developing a realtime financial database that grabs stock quotes from the net multiple times a minute and stores it in a database. I am currently working with SQLAlchemy over MySQL, but I came across Redis and it looks interesting. It looks good especially because of its performance, which is crucial in my application. I know that MySQL can be fast too, I just feel like implementing heavy caching is going to be a pain.
The data I am saving is by far mostly decimal values. I am also doing a significant amount of divisions and multiplications with these decimal values (in a different application).
In terms of data size, I am grabbing about 10,000 symbols multiple times a minute. This amounts to about 3 TB of data a year.
I am also concerned by Redis's key quantity limitation (2^32). Is Redis a good solution here? What other factors can help me make the decision either toward MySQL or Redis?
Thank you!
Redis is an in-memory store. All the data must fit in memory. So except if you have 3 TB of RAM per year of data, it is not the right option. The 2^32 limit is not really an issue in practice, because you would probably have to shard your data anyway (i.e. use multiple instances), and because the limit is actually 2^32 keys with 2^32 items per key.
If you have enough memory and still want to use (sharded) Redis, here is how you can store space efficient time series: https://github.com/antirez/redis-timeseries
You may also want to patch Redis in order to add a proper time series data structure. See Luca Sbardella's implementation at:
https://github.com/lsbardel/redis
http://lsbardel.github.com/python-stdnet/contrib/redis_timeseries.html
Redis is excellent to aggregate statistics in real time and store the result of these caclulations (i.e. DIRT applications). However, storing historical data in Redis is much less interesting, since it offers no query language to perform offline calculations on these data. Btree based stores supporting sharding (MongoDB for instance) are probably more convenient than Redis to store large time series.
Traditional relational databases are not so bad to store time series. People have dedicated entire books to this topic:
Developing Time-Oriented Database Applications in SQL
Another option you may want to consider is using a bigdata solution:
storing massive ordered time series data in bigtable derivatives
IMO the main point (whatever the storage engine) is to evaluate the access patterns to these data. What do you want to use these data for? How will you access these data once they have been stored? Do you need to retrieve all the data related to a given symbol? Do you need to retrieve the evolution of several symbols in a given time range? Do you need to correlate values of different symbols by time? etc ...
My advice is to try to list all these access patterns. The choice of a given storage mechanism will only be a consequence of this analysis.
Regarding MySQL usage, I would definitely consider table partitioning because of the volume of the data. Depending on the access patterns, I would also consider the ARCHIVE engine. This engine stores data in compressed flat files. It is space efficient. It can be used with partitioning, so despite it does not index the data, it can be efficient at retrieving a subset of data if the partition granularity is carefully chosen.
You should consider Cassandra or Hbase. Both allow contiguous storage and fast appends, so that when it comes to querying, you get huge performance. Both will easily ingest tens of thousands of points per second.
The key point is along one of your query dimensions (usually by ticker), you're accessing disk (ssd or spinning), contiguously. You're not having to hit indices millions of times. You can model things in Mongo/SQL to get similar performance, but it's more hassle, and you get it "for free" out of the box with the columnar guys, without having to do any client side shenanigans to merge blobs together.
My experience with Cassandra is that it's 10x faster than MongoDB, which is already much faster than most relational databases, for the time series use case, and as data size grows, its advantage over the others grows too. That's true even on a single machine. Here is where you should start.
The only negative on Cassandra at least is that you don't have consistency for a few seconds sometimes if you have a big cluster, so you need either to force it, slowing it down, or you accept that the very very latest print sometimes will be a few seconds old. On a single machine there will be zero consistency problems, and you'll get the same columnar benefits.
Less familiar with Hbase but it claims to be more consistent (there will be a cost elsewhere - CAP theorem), but it's much more of a commitment to setup the Hbase stack.
You should first check the features that Redis offers in terms of data selection and aggregation. Compared to an SQL database, Redis is limited.
In fact, 'Redis vs MySQL' is usually not the right question, since they are apples and pears. If you are refreshing the data in your database (also removing regularly), check out MySQL partitioning. See e.g. the answer I wrote to What is the best way to delete old rows from MySQL on a rolling basis?
>
Check out MySQL Partitioning:
Data that loses its usefulness can often be easily removed from a partitioned table by dropping the partition (or partitions) containing only that data. Conversely, the process of adding new data can in some cases be greatly facilitated by adding one or more new partitions for storing specifically that data.
See e.g. this post to get some ideas on how to apply it:
Using Partitioning and Event Scheduler to Prune Archive Tables
And this one:
Partitioning by dates: the quick how-to
i want to know which data structure(AVL, B-Tree, etc...) is used in most popular relational databases. and also in what way the data structure is superior than other in-class data structures? if possible a small comparison could help me a lot! thanks in advance!
It's usually B-tree or variants thereof, primarily because it packs nodes into blocks, unlike binary trees such as AVL.
A node of a B-tree has a fixed maximum size and holds multiple keys and multiple pointers to child nodes, meaning fewer blocks need to be retrieved from disk to look up a value (compared to a binary tree).
The Wikipedia article on B+ trees has a good introduction from the angle of its application to databases.
For SQL Server, there is background info here.
I would choose the B+ Selection Tree because it is appropriate for efficient insertion, deletion and range queries but if the database has not been changed since it was created, then a SIMPLE LINEAR INDEX is required
I'm planning to write a program in Ruby to analyse some data which has come back from an online questionnaire. There are hundreds of thousands of responses, and each respondent answers about 200 questions. Each question is multiple-choice, so there are a fixed number of possible responses to each.
The intention is to use a piece of demographic data given by each respondent to train a system which can then guess that same piece of demographic data (age, for example) from a respondent who answers the same questionnaire, but doesn't specify the demographic data.
So I plan to use a vector (in the mathematical sense, not in the data structure sense) to represent the answers for a given respondent. This means each vector will be large (over 200 elements), and the total data set will be huge. I plan to store the data in a MySQL database.
So. 2 questions:
How should I store this in the database? One row per response to a single question, or one row per respondent? Or something else?
I'm planning to use something like the k-nearest neighbour algorithm, or a simple machine learning algorithm like a naive bayesian classifier to learn to classify new responses. Should I manipulate the data purely through SQL or should I load it into memory and store it in some kind of vast array?
First thing that comes to mind: Storing it in Memory can be absolutely reasonable for processing purposes. Lets say you reserve one byte for each answer, you have a million responses and 200 questions, then you have a 200 MB array. Not small but definitely not memory exhausting on a modern desktop, even with a 32 bit OS.
As for the database I think you should have three tables. One for the respondent with the demographical data, one for the questions, and, since you have a n:m relation between these tables, a third one with the Respondent-ID, the Question-ID and the Answercode.
If you don't need additional data for the questions (like the question-text or something) you can even optimize away the question table.
Use an array of arrays, in memory. I just created a 500000x200 array and it required about 500MB of RAM. Easily manageable on a 2GB machine, and many, many orders of magnitude faster than using SQL.
Personally, I wouldn't bother putting the data in MySQL at all. Just Marshal it in and out, and/or use JSON or CSV.
If you definitely need database storage, and the comments elsewhere about alternatives are worth considering, then I'd advise against storing 200-odd responses in 200-odd rows: you don't seem to have any obvious need for the flexibility that such a design would give and performance across hundreds of thousands of respondents is going to be dire.
Using a RDBMS gives you the ability to store very large amounts of data, access them in a variety of multi-dimensional ways and extend the structure of your data ad hoc over time. But what you gain in flexibility over a flat file (or Marshalled, or other) option you often lose in performance. I have to confess to reaching for third normal form far too early myself. I guess the questions are, how much flexibility in querying do you expect to need, and how much change do you think your data is likely to undergo? If you think you're at the low end of both, consider leaving the SQL on the shelf. If you abstract your data access into a separate layer then changing should be cheap later. Just a thought...
I'd expect you can encode an individual's response in such a way that it can easily be used in code and it's unlikely to take more than 200 characters, less if you use some sort of packing or bit-mapping. I rather like the idea of bit-mapping, come to think of it - it makes simple comparison using something like Hamming distance an absolute breeze.
I'm not a great database person, so I'll just answer #2:
If you'd really like to save on memory (or foresee a situation where there will be a lot more data) you could take the best of both worlds: Use ruby as essentially a data-mining tool. Have it pull some of the data from the DB, then write the results back to the DB (probably under a different table or database altogether). This has the benefit of only using as much memory as you want it to.
Don't forget that Ruby is a dynamic object language, as such, a simple integer will probably take up more space than a simple int in C. It needs additional space to be able to characterise if it has been 'garnished' with any additional information, methods etc.