I have read in a lot of Online sources that one of the advantages of Graph Databases is flexible schema. But haven't found how that exactly can be achieved.
Wikipedia says 'Graphs are flexible, meaning it allows the user to insert new data into the existing graph without loss of application functionality.'But that is something we can do in a Relational Database also, at-least to an extent.
Can someone shed more light on this?
Thanks.
Edit: Giving an example to make it more clear:
Take for example a User table:
FirstName|LastName|email|Facebook|Twitter|Linkedin|
Now, some users might have FB, and not Twitter and Linkedin or vice versa. Maybe they have multiple email ids? How do you represent that?
Graph DB:
Vertices for User, FB_Link, Twitter_Link, Email.
Edges (from User) to FB, Edge to Twitter, Edge to Linkedin, Edge to Email (x 2 times) etc.
Json/DocumentDB:
{
ID:
FirstName:
LastName:
FB:
}
{
ID:
FirstName:
LastName:
Twitter:
Linkedin:
}
Notice that the documents can have different attributes.
Am I Correct in the above interpretation of Schema Flexibility? Is there more to it?
The wikipedia article is over simplifying things with the statement:
allows the user to insert new data into the existing graph without loss of application functionality
Any database allows you to insert data without losing application functionality. Rather lets focus on the flexible schema side of graph databases because here is where there is a difference.
A Brief Side Note
SQL is built on the relational model which enforces strong consistency checks between data records. It does this via enforcing locks on structural changes. Graph databases are built on the property graph model and this enforces no such relational constraints. Which means no locks (in most cases). It only enforces key-value pairs on constructs called vertices connected together via edges
With that bit context discussed lets talk about your main question:
How Is a Flexible Schema Achieved
Due to property graphs formally not having any constraint rules to satisfy in order to function you can pretty much enforce a graph representation on any data source. Technically even on a sql table with no indices if you so chose.
How this is done practically though varies from graph db to graph db. The field lacks standardisation at the moment. For example,
JanusGraph runs on different NoSql DBs such as a wide column store and a document store.
OrientDB uses a json document store.
RedisGraph uses an in-memory key-value store.
Neo4j uses it's own data model which I am not familiar with.
Note how all of the above use NoSQL DBs as the backbone. This is how a flexible schema is achieved. All these graph databases simply store the data outside of relational dbs/tables where there are "no" rules. Document stores and json being a good example.
If you curious about an example implementation of a property graph model you can checkout Neo4j's model or JanusGraphs' Docs or a generic comparison of their models
Related
I am trying to structure a nosql database for the first time. I have a user table which contains: name and email address. Now each user can have more than 1 device.
Each device has multiple basically has an array of readings.
Here is what my current structure looks like:
How can i improve this structure?
ps: I am using angularjs with angularfire.
In relational databases, there is the concept of normal forms and thus related a somewhat objective measure of whether a data model is normalized.
In NoSQL databases you often end up modeling the data for the way your app consumes it. Hence there is little concept of what constitutes a good data model, without also considering the use-cases of your app.
That said: the Firebase documentation recommends flattening your data. More specifically it recommends against mixing types of data, like you are doing with user metadata and device metadata.
The recommendation would be to split them into two top-level nodes:
/users
<userid1>
email:
id:
name:
<userid2>
email:
id:
name:
/devices
<userid>
<deviceid1>
<measurement1>
<measurement2>
<deviceid2>
<measurement1>
<measurement2>
Further recommended reading is NoSQL data modeling and viewing our Firebase for SQL developers. Oh and of course, the Firebase documentation on structuring data.
I am currently working on a web-application that would allow users to analyze & visualize data. For example, one of the use-cases is that the user will perform a Principal Component Analysis and store it. There can be other such analysis like a volcano plot, heatmap etc.
I would like to store these analysis and visualizations in a database in the back-end. The challenge that I am facing is how to design a relational database schema which will do this efficiently. Here are some of my concerns:
The data associated with the project will already be stored in a normalized manner so that it can be recalled. I would not like to store it again with the visualization.
At the same time, the user should be able to see what is the original data behind a visualization. For eg. what data was fed to a PCA algorithm? The user might not use all the data associated with the project for the PCA. He/she could just be doing this on a subset of the data in the project.
The number of visualizations associated with the webapp will grow with time. If I need to design an invoved schema everytime a new visualization is added, it could make overall development slower.
With these in mind, I am wondering if I should try to solve this with a relational database like MySQL at all. Or should I look at MongoDB? More generally, how do I think about this problem? I tried looking for some blogs/tutorials online but couldn't find much that was useful.
The first step you should do before thinking about technical design, including a relational or non-SQL platform, is a data model that clearly describes the structure and relations between your data in a platform independent way. I see the following interesting points to solve there:
How is a visualisation related to the data objects it visualizes? When the visualisation just displays the data of one object type (let's say the number of sales per month), this is trivial. But if it covers more than one object type (the number of sales per month, product category, and country), you will have to decide to which of them to link it. There is no single correct solution for this, but it depends on the requirements from the users' view: From which origins will they come to find this visualisation? If they always come from the same origin (let's say the country), it will be enough to link the visuals to that object type.
How will you handle insertions, deletes, and updates of the basic data since the point in time the visualisation has been generated? If no such operations relevant to the visuals are possible, then it's easy: Just store the selection criteria (country = "Austria", product category = "Toys") with the visual, and everyone will know its meaning. If, however, the basic data can be changed, you should implement a data model that covers historizing those data, i.e. being able to reconstruct the data values on which the original visual was based. Of course, before deciding on this, you need to clarify the requirements: Will, in case of changed basic data, the original visual still be of interest or will it need to be re-generated to reflect the changes?
Both questions are neither simplified nor complicated by using a NOSQL database.
No matter what the outcome of those requirements and data modeling efforts are, I would stick to the following principles:
Separate the visuals from the basic data, even if a visual is closely related to just one set of basic data. Reason: The visuals are just a consequence of the basic data that can be re-calculated in case they get lost. So the requirements e.g. for data backup will be more strict for the basic data than for the visuals.
Don't store basic data redundantly to show the basis for each single visual. A timestamp logic with each record of basic data, together with the timestamp of the generated visual will serve the same purpose with less effort and storage volume.
I have seen quite a few products with graph-related data models built on both Neo4j and a relational or document database. The other db is generally used to store the metadata of each node.
I am considering building a product relying entirely on Neo4j, storing all my objects' metadata as node properties. Is there any caveat in doing so?
Entirely depends of how much metadata you want to store. 10 primitive / short String properties per node is absolutely fine. 1000 large JSON documents per node... not so much. It isn't a document store.
What sort of numbers are we talking about? I would suggest you generate a random graph with similar number of properties and similar values that you wish to have in your product. See how it performs.
Otherwise no caveats I would say. Oh, don't refer to internal Neo4j node IDs anywhere; unlike in a relational database, these get re-used.
I've come across a couple of ORMs that recently announced they are planning to move their implementation from Active Record to Data Mapper. My knowledge of this subject is very limited. So a question for those who know better, is Data Mapper newer than Active Record? Was it around when the Active Record movement started? How do the two relate together?
Lastly since I'm not a database person and know little about this subject, should I follow an ORM that's moving to the Data Mapper implementation, as in what's in it for me as someone writing software (not a data person)?
The DataMapper is not more modern or newer, but just more suited for an ORM.
The main reason people change is because ActiveRecord does not make for a good ORM. An AR wraps a row in a database table or view, encapsulates the database access, and adds domain logic on that data. So by definition, an AR is a 1:1 representation of a database record, which makes it particularly suited for simple CRUD.
Some ARs added fetching of related data, which made people believe AR is an ORM. It is not. The point of an ORM is to tackle the object relational impedance mismatch between your database structure and your domain objects. When using AR, you don't solve this impedance mismatch because your AR represents a database row and not a proper OO design. You are tieing your db layout to your objects. Some of the object-relational behavioral patterns can still be applied though (for instance lazy loading).
Another reason why AR is often criticised is because it intermingles two concerns: business logic and db access logic. This leads to unwanted coupling and can result in less maintainability and flexibility in larger applications. There is no isolation between the two layers. Coupling always leads to less flexibility.
A DataMapper on the other hand moves data between objects and a database while keeping them independent of each other and the mapper itself. While more difficult to implement, it allows for much more flexible design in your application. Your domain objects no longer have to match the db structure. DAL and Domain layer are decoupled.
Even though the post is 8 years old, the question is still valid in 2018.
Active record is Anti pattern beware of that. It creates a very tight coupling between code and database. It might not be a problem for small simple projects. However, I would strongly recommend to avoid using it in anything bigger.
A good OOP design is done in layers. Input layer, service layer, repository layer, data mapper and DB - just a simple example. You should not mix Input layer with the DB. How this can be done? For example, in Laravel, you can use a Validator rule like this:
'email' => 'exists:staff,email'
It checks whether the email exists in the table staff.
This is a complete OOP non-sense. It tights your top layer with the DB column name. I cannot imagine any better example of a bad OOP design.
The bottom line - if you are creating a simple site with 2-3 tables, like a blog, Active record might not be a problem. For anything bigger, go for Data Mapper and be careful about OOP principles such as IoC, SoC, etc.
Disclaimer: let me know if this question is better suited for serverfault.com
I want to store information on music, specifically:
genres
artists
albums
songs
This information will be used in a web application, and I want people to be able to see all of the songs associated to an album, and albums associated to an artist, and artists associated to a genre.
I'm currently using MySQL, but before I make a decision to switch I want to know:
How easy is scaling horizontally?
Is it easier to manage than an SQL based solution?
Would the above data I want to store be too hard to do schema-free?
When I think association, I immediately think RDBMSs; can data be stored in something like CouchDB but still have some kind of association as stated above?
My web application requires replication, how well does CouchDB or others handle this?
Your data seems ideal for document oriented databases.
Document example:
{
"type":"Album",
"artist":"ArtistName",
"album_name":"AlbumName",
"songs" : [
{"title":"SongTitle","duration":4.5}
],
"genres":["rock","indie"]
}
And replication is one of couchDB coolest features ( http://blog.couch.io/post/468392274/whats-new-in-apache-couchdb-0-11-part-three-new )
You might also wanna take a look at Riak.
This kind of information is ideally suited to document databases. As with much real-world data, it is not inherently relational, so shoe-horning it into a relational schema will bring headaches down the line (even using an ORM - I speak from experience). Ubuntu already uses CouchDB for storing music metadata, as well as other things, in their One product.
Taking the remainder of your questions one-by-one:
Horizontal scaling is WAY easier than with RDBMS. This is one of the many reasons big sites like Facebook, Digg and LinkedIn are using, or are actively investigating, schema-less databases. For example, sharding (dividing your data across different nodes in a system) works beautifully thanks to a concept called Eventual Consistency; i.e., the data may be inconsistent across nodes for a while, but it will eventually resolve to a consistent state.
It depends what you mean by "manage"... Installation is generally quick and easy to complete. There are no user accounts to configure and secure (this is instead generally done in the application's business logic layer). Working with a document DB in real time can be interesting: there's no ad hoc querying in CouchDB, for example; you have to use the Futon UI or communicate with it via HTTP requests. MongoDB, however, does support ad hoc querying.
I shouldn't think so. Bastien's answer provides a good example of a JSON document serialising some data. The beauty of schemaless DBs is that fields can be missing from one document and present in another, or the documents can be completely different from one another. This removes many of the problems involved with RDBMS' null value, which are many and varied.
Yes; the associations are stored as nested documents, which are parsed in your application as object references, collections, etc. In Bastien's answer, the "songs" key identifies an array of song documents.
This is very similar to your first question about horizontal scaling (horizontal scaling and replication are intertwined). As the CouchIO blog post Bastien mentioned states, "Replication … has been baked into CouchDB from the beginning.". My understanding is that all document databases handle replication well, and do so more easily than it is to set it up in an RDBMS.
Were you to decide you wanted to store the song file itself along with the metadata, you could do that too in CouchDB, by supplying the song file as an attachment to the document; further more, you wouldn't have any schema inconsistencies as a result of doing this, because there is no schema!
I hope I haven't made too many missteps here; I'm quite new to document DBs myself.