couchbase replication across data center - couchbase

Let's say I'm setting up couchbase with a bucket replicating across N number of data centers.
Can I write documents into bucket in each data center and use replication, so each bucket in each data center will contain all documents in all data centers ?

Yes. As long as you create a replication to each data center this will work. You could for example set up bi-directional replication as follows:
(Data Center A) <-- rep 1 --> (Data Center B) <-- rep 2 --> (Data Center C)
or any other combination of replications as long as there is a path to each data center through those replications.
See http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-admin-tasks-xdcr.html for more information.

Related

MySQL Cluster and big blobs

I decided to use a MySQL Cluster for a bigger project of mine. Beside storing documents in a simple table scheme with only three indexes, a need to store information in the size of 1MB to 50MB arise. Those informations will be serialized custom tables being aggregats of data feeds.
How will be those information be stored and how many nodes will those information hit? I understand that with a replication factor of three those information will be written three times and I understand that there are coordinator nodes (named differently) so I ask myself what will be the impact storing those information?
Is it right that I understand that for a read a cluster will send those blobs to three servers (one requested the information, one coordinator and one data server) and for a write it is 5 (1+1+3)?
Generally speaking MySQL only supports NoOfReplicas=2 right now, using 3 or 4 is generally not supported and not very well tested, this is noted here:
http://dev.mysql.com/doc/refman/5.6/en/mysql-cluster-ndbd-definition.html#ndbparam-ndbd-noofreplicas
"The maximum possible value is 4; currently, only the values 1 and 2 are actually supported."
As also described in the above URL, the data is stored with the same number of replicas as this setting. So with NoOfReplicas=2, you get 2 copies. These are stored on the ndbd (or ndbmtd) nodes, the management nodes (ndb_mgmd) act as co-ordinators and the source of configuration, they do not store any data and neither does the mysqld node.
If you had 4 data nodes, you would have your entire dataset split in half and then each half is stored on 2 of the 4 data nodes. If you had 8 data nodes, your entire data set would be split into four parts and then each part stored on 2 of the 8 data nodes.
This process is sometimes known as "Partitioning". When a query runs, the data is split up and sent to each partition which processes it locally as much as possible (for example by removing non-matching rows using indexes, this is called engine condition pushdown, see http://dev.mysql.com/doc/refman/5.6/en/condition-pushdown-optimization.html) and then it is aggregated in mysqld for final processing (may include calculations, joins, sorting, etc) and return to the client. The ndb_mgmd nodes do not get involved in the actual data processing in any way.
Data is by default partitioned by the PRIMARY KEY, but you can change this to partition by other columns. Some people use this to ensure that a given query is only processed on a single data node much of the time, for example by partitioning a table to ensure all rows for the same customer are on a single data node rather than spread across them. This may be better, or worse, depending on what you are trying to do.
You can read more about data partitioning and replication here:
http://dev.mysql.com/doc/refman/5.6/en/mysql-cluster-nodes-groups.html
Note that MySQL Cluster is really not ideal for storing such large data, in any case you will likely need to tune some settings and try hard to keep your transactions small. There are some specific extra limitations/implications of using BLOB which you can find discussed here:
http://dev.mysql.com/doc/mysql-cluster-excerpt/5.6/en/mysql-cluster-limitations-transactions.html
I would run comprehensive tests to ensure it is performing well under high load if you go ahead and ensure you setup good monitoring and test your failure scenarios.
Lastly, I would also strongly recommend getting pre-sales support and a support contract from Oracle, as MySQL Cluster is quite a complicated product and needs to be configured and used correctly to get the best out of it. In the interest of disclosure, I work for Oracle in MySQL Support -- so you can take that recommendation as either biased or very well informed.

MySql or Cassandra

first of all I apologize for my English.
I have to store the following data structure:
n Nodes linked by m Edges (a Graph).
Every node has a number of attributes(the same number and type for each node).
These nodes should also be linked to another set of objects (composed by a set of meta-datas(every object could have a different number of meta-datas) and a BLOB).
the number of nodes is 1000000 circa and the number of edges is 800000000
The point is: MySql or Cassandra?
Let me know if you need more details!
Thank you in advance.
There are graph databases that are designed specifically to store graph data, Neo4j is one example.
Generally: if it has to be one of those two, do both and measure the performance. You can easily do this with both an SQL and a NoSQL approach. Also you didn't mention what queries you'll be running on the dataset (which greatly impacts the decision).
That said, nowadays I'd try to go for Cassandra whenever I have the possibility to do so, since the whole multi-node replication (and resulting fault tolerance) and all doesn't really work well with MySQL.

Two types of data, so two type of databases?

For a social network site, I need to propose a DB. The application is written in Java & will be hosted on VPS(s) initially.
Broadly classified there is two type of data to be stored at backend:
1. dynamic lists which are:
- frequently appended to
- frequently read
- sometimes reduced
2. fixed set of data keyed by a primary key(sometimes modified).
"For serving any page, I need to have access to both kind of data!"
As demanded by every other SN site, we need to consider for easy scaling in the future, but in addition to that our team & resources are also very very limited. We would like to start with a 1 or 2 medium sized VPS(s) & add more servers as data & load grows.
Personally I usually prefer something that is used by a large community, so ofcourse MySQL is big option but it doesn't fit our entire needs. It could be used for 2nd kind of data(among the list above) ie for storing fixed set of columns/data but not ideal for storing dynamic lists(ie 1st kind).
So should I use a 2nd database just to fit in only that type of data (two database each containing only data best suited for them)? (Some suggested Cassandra to store the 2nd kind of data.)
What is the way to go ?
Use a traditional database when you need transactional integrity and you have a fixed set of relations to map.
Use a document database when you have multiple properties of objects to store in a flat structure; or where the schema (the properties of the objects) may change over time. This is one of the weaknesses of traditional database systems; changing schemas is possible but has lots of performance side-effects. In document databases, the properties of the object being stored have little impact on the overall performance of the system - and more practically the information stored about objects (their properties or "columns") can be modified without having to worry about schemas.
Use a key value store for ephemeral data.
From what you have described, I don't see any use case that would require a relational database.

RRDTool and projects that use it (cacti etc) - HOWTO, storage, backup etc

I want to create an application similar to cacti.
I would like to store time-series data in a MySQL database (that is rotated on schedule).
Where does cacti (nagios, zenoss) store polled data?
a) in a MySQL database
b) in a RRD database
c) both?
How does cacti (nagios, zenoss) make room for more data when it runs out of space?
How is data back-up made (when there is no more space), without loosing the already inserted data ?
The questions are in the form "How does X do Y?" but the more general issue is "How should I do Y?".
Cacti stores its data in an RRD, a "round robin database".
Old data is rotated out, hence the "round robin" moniker. Alex VandenBogaert's basic rrdtool tutorial has more details about this: http://oss.oetiker.ch/rrdtool/tut/rrdtutorial.en.html
This is one of those "it depends" answers - with RRDTool, the data gets averaged & aged out, so you don't run out of space in the RRD (see #2). Normally, you plan for the amount of data you want to store when you create the RRDs, but that can take some experience & tweaking.
As to how you should do this - it depends on what you want to do with the data. With RRDTool, you don't get back the exact data you put in (due to the averaging over time). The tutorial link above should give you enough info to help you make that decision.

Modeling a directed graph with a special center node

I'm looking for opinions on how to model a directed graph that contains one special node.
Special node:
Cannot have any edges leading to it.
Cannot be removed.
Current design:
Tables: Nodes, Edges. Edges contains two columns; from_node_id and to_node_id, each referencing a record in the Nodes table.
Rather than storing the special node as the first record in the Nodes table, I decided not to keep a record for it at all, constructing it separately from any database queries. In the Edges table, NULL takes on a special meaning in from_node_id column, referring to the center node.
My motivation for using this design was that I wouldn't have to worry about protecting a center node record from deletion/modification or being referenced in the to_node_id column of the Edges table. This would also automatically prevent an edge from going from and to the same node. I realize there are some drawbacks to this design, such as not being able to make from_node_id and to_node_id a composite primary key, and probably many more.
I'm currently leaning towards making the center node an actual record and creating checks for that node in the relevant database methods. What's the best way to go about this design?
I see some arguments against using NULL in this case.
If nodes contain actual data you would have to hard-code data for the central node in the application.
There will be trouble if the central node can be changed.
The usual meaning of NULL is that there is no value or the value is unknown. Because of this another person who approaches the proposed design could find it unintuitive.
In other words I would prefer to have row in the database for the central node.