Mongo vs MySql Search Optimization - mysql

So I'm in the process of designing a system that is going to store document type of data (i.e. transcribed documents). Immediately, I thought this would be a great opportunity to leverage a NOSQL implementation like MongoDB. However, given that I have zero experience with Mongo, I'm wondering: on each of these docuemnts, I have a number of metadata tags I want to be able to search across: things like date, author, keywords, etc. If I were to use an RDBMS like MySql, I'd probably store these items in a separate table liked by a foreign key and the index the items that were most likely to be searched on. Then I could run queries against that table and only pull back and the full text results for the items that matched (saves on disk read not having to reach through a row that contains a lot of text or BLOB information).
Would something similar be possible with Mongo? I know in Mongo I could simply create 1 document that would have all the metadata AND the actual transcription but is it easy and highly performant to search the various fields in the metadata if the document is stored like that? Is there a best practice when needing to perform searches across various items in a document in Mongo? Or is this type of scenario more suited for an RDBMS rather than a NOSQL implementation?

You can add indexes for individual fields in mongodb documents. Only when the indexes get larger than your memory, performance of index based searches may become a problem.
When you decide if to go with mongodb, keep in mind that there is no join operation. This has to be done by your db layer or above.
If your primary concern is searching, there is an ElasticSearch river for mongodb, so you can utilize ElasticSearch on your dataset.

The NoSQL model, is geared for data storage in long range (OLTP model), yes you can create indexes and perform queries that you want, instead of you having related entities across tables, you have a complete entity that owns all entities dependent on it within herself.
When you have to extract complex reports with many joins in a relational database in a context of millions of data becomes impractical such an act, because you may end up compromising other applications.
For example:
We have the room and student bodies.
Each room have many students, the relational model we would have the following:
SELECT * FROM ROOM R
INNER JOIN
S STUDENT
ON = S.ID R.STUDENTID
Imagine doing that with some 20 tables with thousands of data? His performance will be horrible.
With MongoDB you will do so:
db.sala.find (null)
And will have all their rooms with their students.
MongoDB is a database that performs scanning horizontally.
You can read:
http://openmymind.net/mongodb.pdf
This site also has an interactive tutorial that uses the book's examples. Very nice.
And here you can experience the mongodb online and test your commands.
Search for try mongo db.
Also read about shards with replicaSets. I believe it will open your mind greatly.
You can install Robomongo which is a graphical interface for you to tinker with mongodb.
http://robomongo.org/

Related

Database optimized for searching in large number of objects with different attributes

Im am currently searching for an alternative to our aging MySQL database using an EAV approach. Current projects seem to have outgrown traditional table oriented database structures and especially searches in such database.
I head and researched about various NoSQL database systems but I can't find anything that seems to be what Im looking for. Maybe you can help.
I'll show you a generalized example on what kind of data I have and what operations I want to execute on them:
I have an object that has a small number of META attributes. Attributes that are common to all instanced of my objects. For example these
DataObject Common (META) Attributes
Unique ID (Some kind of string containing a unique identifier)
Created Date (A date time showing creation time of the object)
Type (Some kind of type identifier, maybe something like "Article", "News", "Image" or "Video"
... I think you get the Idea
Then each of my Objects has a variable number of other attributes. Most probably, many Objects will share a number of these attributes, but there is no rule. For my sample, we say each Object instance has between 5 to 20 such attributes. Here are some samples
Data Object variable Attributes
Color (Some CSS like color string)
Name (A string)
Category (The category or Tag of this item) (Maybe we also have more than one of these?)
URL (a url containing some website)
Cost (a number with decimals
... And a whole lot of other stuff mostly being of the usual column types
References to other data is an idea, but not a MUST at the moment. I could provide those within my application logic if needed.
A small sample:
Image
Unique ID = "0s987tncsgdfb64s5dxnt"
Created Date = "2013-11-21 12:23:11"
Type = "Image"
Title = "A cute cat"
Category = "Animal"
Size = "10234"
Mime = "image/jpeg"
Filename = "cat_123.jpg"
Copyright = "None"
Typical Operations
An average storage would probably have around 1-5 million such objects, each with 5-20 attributes.
Apart from the usual stuff like writing one object to database or readin it by it's uid, the most problematic operations are these:
Search by several attributes - Select every DataObject that has Type "News" the Titel contains "blue" and the Created Date is after 2012.
Paged bulk read - Get a large number of objects from a search (see above) starting at element 100 and ending at 250
Get many objects with all of their attributes - When reading larger numbers of objects, I need to get every object with all of it's attributes in one call.
Storage Requirements
Persistance - The storage needs to be persistance and not in memory only. If the server reboots, the data has to be at the same point in time as when it shut down before. No memory only systems.
Integrity - All data is important, nothing can be ignored. So every single write action has to be securely stored. Systems (Redis?) that tend to loose something now and then arent usable. Systems with huge asynchronity are also problematic. If data changes, every responsible node should see that.
Complexity - The system should be fairly easy to setup and maintain. So, systems that force the admin to take many week long courses in it's use arent really a solution here. Same goes for huge data warehouses with loads of nodes. Clustering is nice, but it should also be possible to get a cheap system with one node.
tl;dr
Need super fast database system with object oriented data and fast searched even with hundreds of thousands of items.
A reason as to why I am searching for a better alternative to mysql can be found here: Need MySQL optimization for complex search on EAV structured data
Update
Key-Value stores like Redis weren't an option as we need to do some heavy searching insode our data. Somethng which isnt possible in a typical Key-Value store.
In the end, we are using MongoDB with a slightly optimized scheme to make best use of MongoDBs use of indizes.
Some small drawback still remain but are acceptable at the moment:
- MongoDBs aggregate function can not wotk with very large result sets. We have to use find (and refine our data structure to make that one sufficient)
- You can not sort large datasets on specific values as it would take up to much memory. You also cant create indizes on those values as they are schema free.
I don't know if you wan't a more sophisticated answer than mine. But maybe i can inspire you a little.
MySql are scaleable and can be used for exactly your course. I think it's more of an optimization and server problem if you database i slow. Many system with massive amount of data i using MySql and works perfectly, Though NoSql (Not-Only SQL) is built for large amount of data with different attributes.
There's many diffrent NoSql providers and they have different ways of handling you data.
Think about that before you choose a NoSql platform.
The possibilities are
Key–value Stores - ex. Redis, Voldemort, Oracle BDB
Column Store - ex. Cassandra, HBase
Document Store - ex. CouchDB, MongoDb
Graph Database - ex. Neo4J, InfoGrid, Infinite Graph
Most website uses document based storing, but ex. facebook are using the column based, because of the many dynamic atrribute.
You can try the Document based NoSql at http://try.mongodb.org/
In the end, it really depends on how you build and optimize you database, and not from which technology you choose, though chossing the right technology can save a bunch of time.
The system we have developed are using a a combination of MySql and NoSql depending on what data we are working with. MySql for the system itself and NoSql for all the data we import via API's.
Hope this inspires a little and feel free to ask any westions

Storing JSON in database vs. having a new column for each key

I am implementing the following model for storing user related data in my table - I have 2 columns - uid (primary key) and a meta column which stores other data about the user in JSON format.
uid | meta
--------------------------------------------------
1 | {name:['foo'],
| emailid:['foo#bar.com','bar#foo.com']}
--------------------------------------------------
2 | {name:['sann'],
| emailid:['sann#bar.com','sann#foo.com']}
--------------------------------------------------
Is this a better way (performance-wise, design-wise) than the one-column-per-property model, where the table will have many columns like uid, name, emailid.
What I like about the first model is, you can add as many fields as possible there is no limitation.
Also, I was wondering, now that I have implemented the first model. How do I perform a query on it, like, I want to fetch all the users who have name like 'foo'?
Question - Which is the better way to store user related data (keeping in mind that number of fields is not fixed) in database using - JSON or column-per-field? Also, if the first model is implemented, how to query database as described above? Should I use both the models, by storing all the data which may be searched by a query in a separate row and the other data in JSON (is a different row)?
Update
Since there won't be too many columns on which I need to perform search, is it wise to use both the models? Key-per-column for the data I need to search and JSON for others (in the same MySQL database)?
Updated 4 June 2017
Given that this question/answer have gained some popularity, I figured it was worth an update.
When this question was originally posted, MySQL had no support for JSON data types and the support in PostgreSQL was in its infancy. Since 5.7, MySQL now supports a JSON data type (in a binary storage format), and PostgreSQL JSONB has matured significantly. Both products provide performant JSON types that can store arbitrary documents, including support for indexing specific keys of the JSON object.
However, I still stand by my original statement that your default preference, when using a relational database, should still be column-per-value. Relational databases are still built on the assumption of that the data within them will be fairly well normalized. The query planner has better optimization information when looking at columns than when looking at keys in a JSON document. Foreign keys can be created between columns (but not between keys in JSON documents). Importantly: if the majority of your schema is volatile enough to justify using JSON, you might want to at least consider if a relational database is the right choice.
That said, few applications are perfectly relational or document-oriented. Most applications have some mix of both. Here are some examples where I personally have found JSON useful in a relational database:
When storing email addresses and phone numbers for a contact, where storing them as values in a JSON array is much easier to manage than multiple separate tables
Saving arbitrary key/value user preferences (where the value can be boolean, textual, or numeric, and you don't want to have separate columns for different data types)
Storing configuration data that has no defined schema (if you're building Zapier, or IFTTT and need to store configuration data for each integration)
I'm sure there are others as well, but these are just a few quick examples.
Original Answer
If you really want to be able to add as many fields as you want with no limitation (other than an arbitrary document size limit), consider a NoSQL solution such as MongoDB.
For relational databases: use one column per value. Putting a JSON blob in a column makes it virtually impossible to query (and painfully slow when you actually find a query that works).
Relational databases take advantage of data types when indexing, and are intended to be implemented with a normalized structure.
As a side note: this isn't to say you should never store JSON in a relational database. If you're adding true metadata, or if your JSON is describing information that does not need to be queried and is only used for display, it may be overkill to create a separate column for all of the data points.
Like most things "it depends". It's not right or wrong/good or bad in and of itself to store data in columns or JSON. It depends on what you need to do with it later. What is your predicted way of accessing this data? Will you need to cross reference other data?
Other people have answered pretty well what the technical trade-off are.
Not many people have discussed that your app and features evolve over time and how this data storage decision impacts your team.
Because one of the temptations of using JSON is to avoid migrating schema and so if the team is not disciplined, it's very easy to stick yet another key/value pair into a JSON field. There's no migration for it, no one remembers what it's for. There is no validation on it.
My team used JSON along side traditional columns in postgres and at first it was the best thing since sliced bread. JSON was attractive and powerful, until one day we realized that flexibility came at a cost and it's suddenly a real pain point. Sometimes that point creeps up really quickly and then it becomes hard to change because we've built so many other things on top of this design decision.
Overtime, adding new features, having the data in JSON led to more complicated looking queries than what might have been added if we stuck to traditional columns. So then we started fishing certain key values back out into columns so that we could make joins and make comparisons between values. Bad idea. Now we had duplication. A new developer would come on board and be confused? Which is the value I should be saving back into? The JSON one or the column?
The JSON fields became junk drawers for little pieces of this and that. No data validation on the database level, no consistency or integrity between documents. That pushed all that responsibility into the app instead of getting hard type and constraint checking from traditional columns.
Looking back, JSON allowed us to iterate very quickly and get something out the door. It was great. However after we reached a certain team size it's flexibility also allowed us to hang ourselves with a long rope of technical debt which then slowed down subsequent feature evolution progress. Use with caution.
Think long and hard about what the nature of your data is. It's the foundation of your app. How will the data be used over time. And how is it likely TO CHANGE?
Just tossing it out there, but WordPress has a structure for this kind of stuff (at least WordPress was the first place I observed it, it probably originated elsewhere).
It allows limitless keys, and is faster to search than using a JSON blob, but not as fast as some of the NoSQL solutions.
uid | meta_key | meta_val
----------------------------------
1 name Frank
1 age 12
2 name Jeremiah
3 fav_food pizza
.................
EDIT
For storing history/multiple keys
uid | meta_id | meta_key | meta_val
----------------------------------------------------
1 1 name Frank
1 2 name John
1 3 age 12
2 4 name Jeremiah
3 5 fav_food pizza
.................
and query via something like this:
select meta_val from `table` where meta_key = 'name' and uid = 1 order by meta_id desc
the drawback of the approach is exactly what you mentioned :
it makes it VERY slow to find things, since each time you need to perform a text-search on it.
value per column instead matches the whole string.
Your approach (JSON based data) is fine for data you don't need to search by, and just need to display along with your normal data.
Edit: Just to clarify, the above goes for classic relational databases. NoSQL use JSON internally, and are probably a better option if that is the desired behavior.
Basically, the first model you are using is called as document-based storage. You should have a look at popular NoSQL document-based database like MongoDB and CouchDB. Basically, in document based db's, you store data in json files and then you can query on these json files.
The Second model is the popular relational database structure.
If you want to use relational database like MySql then i would suggest you to only use second model. There is no point in using MySql and storing data as in the first model.
To answer your second question, there is no way to query name like 'foo' if you use first model.
It seems that you're mainly hesitating whether to use a relational model or not.
As it stands, your example would fit a relational model reasonably well, but the problem may come of course when you need to make this model evolve.
If you only have one (or a few pre-determined) levels of attributes for your main entity (user), you could still use an Entity Attribute Value (EAV) model in a relational database. (This also has its pros and cons.)
If you anticipate that you'll get less structured values that you'll want to search using your application, MySQL might not be the best choice here.
If you were using PostgreSQL, you could potentially get the best of both worlds. (This really depends on the actual structure of the data here... MySQL isn't necessarily the wrong choice either, and the NoSQL options can be of interest, I'm just suggesting alternatives.)
Indeed, PostgreSQL can build index on (immutable) functions (which MySQL can't as far as I know) and in recent versions, you could use PLV8 on the JSON data directly to build indexes on specific JSON elements of interest, which would improve the speed of your queries when searching for that data.
EDIT:
Since there won't be too many columns on which I need to perform
search, is it wise to use both the models? Key-per-column for the data
I need to search and JSON for others (in the same MySQL database)?
Mixing the two models isn't necessarily wrong (assuming the extra space is negligible), but it may cause problems if you don't make sure the two data sets are kept in sync: your application must never change one without also updating the other.
A good way to achieve this would be to have a trigger perform the automatic update, by running a stored procedure within the database server whenever an update or insert is made. As far as I'm aware, the MySQL stored procedure language probably lack support for any sort of JSON processing. Again PostgreSQL with PLV8 support (and possibly other RDBMS with more flexible stored procedure languages) should be more useful (updating your relational column automatically using a trigger is quite similar to updating an index in the same way).
short answer
you have to mix between them ,
use json for data that you are not going to make relations with them like contact data , address , products variabls
some time joins on the table will be an overhead. lets say for OLAP. if i have two tables one is ORDERS table and other one is ORDER_DETAILS. For getting all the order details we have to join two tables this will make the query slower when no of rows in the tables increase lets say in millions or so.. left/right join is too slower than inner join.
I Think if we add JSON string/Object in the respective ORDERS entry JOIN will be avoided. add report generation will be faster...
You are trying to fit a non-relational model into a relational database, I think you would be better served using a NoSQL database such as MongoDB. There is no predefined schema which fits in with your requirement of having no limitation to the number of fields (see the typical MongoDB collection example). Check out the MongoDB documentation to get an idea of how you'd query your documents, e.g.
db.mycollection.find(
{
name: 'sann'
}
)
As others have pointed out queries will be slower. I'd suggest to add at least an '_ID' column to query by that instead.

When to consider Solr

I am working on an application that needs to do interesting things with search, including full-text search, hit-highlighting, faceted-search, etc...
The dataset is likely to be between 3000-10000 records with 20-30 fields on each, and is all stored in MySQL. The traffic profile of the site is likely to be on the small size of medium.
All of these requirements could be achieved (clunkily) in MySQL, but at what point (in terms of data-size and traffic levels) does it become worth looking at more focused technologies like Solr or Sphinx?
This question calls for a very broad answer to be answered in all aspects. There are very well certain specificas that may make one system superior to another for a special use case, but I want to cover the basics here.
I will deal entirely with Solr as an example for several search engines that function roughly the same way.
I want to start with some hard facts:
You cannot rely on Solr/Lucene as a secure database. There are a list of facts why but they mostly consist of missing recovery options, lack of acid transactions, possible complications etc. If you decide to use solr, you need to populate your index from another source like an SQL table. In fact solr is perfect for storing documents that include data from several tables and relations, that would otherwise requrie complex joins to be constructed.
Solr/Lucene provides mind blowing text-analysis / stemming / full text search scoring / fuzziness functions. Things you just can not do with MySQL. In fact full text search in MySql is limited to MyIsam and scoring is very trivial and limited. Weighting fields, boosting documents on certain metrics, score results based on phrase proximity, matching accurazy etc is very hard work to almost impossible.
In Solr/Lucene you have documents. You cannot really store relations and process. Well you can of course index the keys of other documents inside a multivalued field of some document so this way you can actually store 1:n relations and do it both ways to get n:n, but its data overhead. Don't get me wrong, its perfectily fine and efficient for a lot of purposes (for example for some product catalog where you want to store the distributors for products and you want to search only parts that are available at certain distributors or something). But you reach the end of possibilities with HAS / HAS NOT. You can almonst not do something like "get all products that are available at at least 3 distributors".
Solr/Lucene has very nice facetting features and post search analysis. For example: After a very broad search that had 40000 hits you can display that you would only get 3 hits if you refined your search to the combination of having this field this value and that field that value. Stuff that need additional queries in MySQL is done efficiently and convinient.
So let's sum up
The power of Lucene is text searching/analyzing. It is also mind blowingly fast because of the reverse index structure. You can really do a lot of post processing and satisfy other needs. Altough it's document oriented and has no "graph querying" like triple stores do with SPARQL, basic N:M relations are possible to store and to query. If your application is focused on text searching you should definitely go for Solr/Lucene if you haven't good reasons, like very complex, multi-dmensional range filter queries, to do otherwise.
If you do not have text-search but rather something where you can point and click something but not enter text, good old relational databases are probably a better way to go.
Use Solr if:
You do not want to stress your database.
Get really full text search.
Perform lightning fast search results.
I currently maintain a news website with 5 million users per month, with MySQL as the main datastore and Solr as the search engine.
Solr works like magick for full text indexing, which is difficult to achieve with Mysql. A mix of Mysql and Solr can be used: Mysql for CRUD operations and Solr for searches. I have previusly worked with one of India's best real estate online classifieds portal which was using Solr for search ( and was previously using Mysql). The migration reduced the search times manifold.
Solr can be easily integrated with Mysql:
Solr Full Dataimport can be used for importing data from Mysql tables into Solr collections.
Solr Delta import can be scheduled at short frequencies to load latest data from Mysql to Solr collections.

What is the difference between a Relational and Non-Relational Database?

MySQL, PostgreSQL and MS SQL Server are relational database systems, and NoSQL, MongoDB, etc. are non-relational DBMSs.
What are the differences between the two types of system?
Hmm, not quite sure what your question is.
In the title you ask about Databases (DB), whereas in the body of your text you ask about Database Management Systems (DBMS). The two are completely different and require different answers.
A DBMS is a tool that allows you to access a DB.
Other than the data itself, a DB is the concept of how that data is structured.
So just like you can program with Oriented Object methodology with a non-OO powered compiler, or vice-versa, so can you set-up a relational database without an RDBMS or use an RDBMS to store non-relational data.
I'll focus on what Relational Database (RDB) means and leave the discussion about what systems do to others.
A relational database (the concept) is a data structure that allows you to link information from different 'tables', or different types of data buckets. A data bucket must contain what is called a key or index (that allows to uniquely identify any atomic chunk of data within the bucket). Other data buckets may refer to that key so as to create a link between their data atoms and the atom pointed to by the key.
A non-relational database just stores data without explicit and structured mechanisms to link data from different buckets to one another.
As to implementing such a scheme, if you have a paper file with an index and in a different paper file you refer to the index to get at the relevant information, then you have implemented a relational database, albeit quite a simple one. So you see that you do not even need a computer (of course it can become tedious very quickly without one to help), similarly you do not need an RDBMS, though arguably an RDBMS is the right tool for the job. That said there are variations as to what the different tools out there can do so choosing the right tool for the job may not be all that straightforward.
I hope this is layman terms enough and is helpful to your understanding.
Relational databases have a mathematical basis (set theory, relational theory), which are distilled into SQL == Structured Query Language.
NoSQL's many forms (e.g. document-based, graph-based, object-based, key-value store, etc.) may or may not be based on a single underpinning mathematical theory. As S. Lott has correctly pointed out, hierarchical data stores do indeed have a mathematical basis. The same might be said for graph databases.
I'm not aware of a universal query language for NoSQL databases.
Most of what you "know" is wrong.
First of all, as a few of the relational gurus routinely (and sometimes stridently) point out, SQL doesn't really fit nearly as closely with relational theory as many people think. Second, most of the differences in "NoSQL" stuff has relatively little to do with whether it's relational or not. Finally, it's pretty difficult to say how "NoSQL" differs from SQL because both represent a pretty wide range of possibilities.
The one major difference that you can count on is that almost anything that supports SQL supports things like triggers in the database itself -- i.e. you can design rules into the database proper that are intended to ensure that the data is always internally consistent. For example, you can set things up so your database asserts that a person must have an address. If you do so, anytime you add a person, it will basically force you to associate that person with some address. You might add a new address or you might associate them with some existing address, but one way or another, the person must have an address. Likewise, if you delete an address, it'll force you to either remove all the people currently at that address, or associate each with some other address. You can do the same for other relationships, such as saying every person must have a mother, every office must have a phone number, etc.
Note that these sorts of things are also guaranteed to happen atomically, so if somebody else looks at the database as you're adding the person, they'll either not see the person at all, or else they'll see the person with the address (or the mother, etc.)
Most of the NoSQL databases do not attempt to provide this kind of enforcement in the database proper. It's up to you, in the code that uses the database, to enforce any relationships necessary for your data. In most cases, it's also possible to see data that's only partially correct, so even if you have a family tree where every person is supposed to be associated with parents, there can be times that whatever constraints you've imposed won't really be enforced. Some will let you do that at will. Others guarantee that it only happens temporarily, though exactly how long it can/will last can be open to question.
The relational database uses a formal system of predicates to address data. The underlying physical implementation is of no substance and can vary to optimize for certain operations, but it must always assume the relational model. In layman's terms, that's just saying I know exactly how many values (attributes) each row (tuple) in my table (relation) has and now I want to exploit the fact accordingly, thoroughly and to it's extreme. That's the true nature of the beast.
Since we're obviously the generation that has had a relational upbringing, if you look at NoSQL database models from the perspective of the relational model, again in layman's terms, the first obvious difference is that no assumptions about the number of values a row can contain is ever made. This is really oversimplifying the matter and does not cleanly apply to the intricacies of the physical models of every NoSQL database, but it's the pinnacle of the relational model and the first assumption we have to leave behind or, if you'd rather, the biggest leap we have to make.
We can agree to two things that are true for every DBMS: it can store any kind of data and has enough mathematical underpinnings to make it possible to manage the data in any way imaginable. The reality is that you'll never want to make the mistake of putting any of the two points to the test, but rather just stick with what the actual DBMS was really made for. In layman's terms: respect the beast within!
(Please note that I've avoided comparing the (obviously) well founded standards revolving around the relational model against the many flavors provided by NoSQL databases. If you'd like, consider NoSQL databases as an umbrella term for any DBMS that does not completely assume the relational model, in exclusion to everything else. The differences are too many, but that's the principal difference and the one I think would be of most use to you to understand the two.)
Try to explain this question in a level referring to a little bit technology
Take MongoDB and Traditional SQL for comparison, imagine the scenario of posting a Tweet on Twitter. This tweet contains 9 pictures. How do you store this tweet and its corresponding pictures?
In terms of traditional relationship SQL, you can store the tweets and pictures in separate tables, and represent the connection through building a new table.
What's more, you can set a field which is an image type, and zip the 9 pictures into a binary document and store it in this field.
Using MongoDB, you could build a document like this (similar to the concept of a table in relational SQL):
{
"id":"XXX",
"user":"XXX",
"date":"xxxx-xx-xx",
"content":{
"text":"XXXX",
"picture":["p1.png","p2.png","p3.png"]
}
Therefore, in my opinion, the main difference is about how do you store the data and the storage level of the relationships between them.
In this example, the data is the tweet and the pictures. The different mechanism about storage level of relationship between them also play a important role in the difference between both.
I hope this small example helps show the difference between SQL and NoSQL (ACID and BASE).
Here's a link of picture about the goals of NoSQL from the Internet:
http://icamchuwordpress-wordpress.stor.sinaapp.com/uploads/2015/01/dbc795f6f262e9d01fa0ab9b323b2dd1_b.png
The difference between relational and non-relational is exactly that. The relational database architecture provides with constraints objects such as primary keys, foreign keys, etc that allows one to tie two or more tables in a relation. This is good so that we normalize our tables which is to say split information about what the database represents into many different tables, once can keep the integrity of the data.
For example, say you have a series of table that houses information about an employee. You could not delete a record from a table without deleting all the records that pertain to such record from the other tables. In this way you implement data integrity. The non-relational database doesn't provide this constraints constructs that will allow you to implement data integrity.
Unless you don't implement this constraint in the front end application that is utilized to populate the databases' tables, you are implementing a mess that can be compared with the wild west.
First up let me start by saying why we need a database.
We need a database to help organise information in such a manner that we can retrieve that data stored in a efficient manner.
Examples of relational database management systems(SQL):
1)Oracle Database
2)SQLite
3)PostgreSQL
4)MySQL
5)Microsoft SQL Server
6)IBM DB2
Examples of non relational database management systems(NoSQL)
1)MongoDB
2)Cassandra
3)Redis
4)Couchbase
5)HBase
6)DocumentDB
7)Neo4j
Relational databases have normalized data, as in information is stored in tables in forms of rows and columns, and normally when data is in normalized form, it helps to reduce data redundancy, and the data in tables are normally related to each other, so when we want to retrieve the data, we can query the data by using join statements and retrieve data as per our need.This is suited when we want to have more writes, less reads, and not much data involved, also its really easy relatively to update data in tables than in non relational databases. Horizontal scaling not possible, vertical scaling possible to some extent.CAP(Consistency, Availability, Partition Tolerant), and ACID (Atomicity, Consistency, Isolation, Duration)compliance.
Let me show entering data to a relational database using PostgreSQL as an example.
First create a product table as follows:
CREATE TABLE products (
product_no integer,
name text,
price numeric
);
then insert the data
INSERT INTO products (product_no, name, price) VALUES (1, 'Cheese', 9.99);
Let's look at another different example:
Here in a relational database, we can link the student table and subject table using relationships, via foreign key, subject ID, but in a non relational database no need to have two documents, as no relationships, so we store all the subject details and student details in one document say student document, then data is getting duplicated, which makes updating records troublesome.
In non relational databases, there is no fixed schema, data is not normalized. no relationships between data is created, all data mostly put in one document. Well suited when handling lots of data, and can transfer lots of data at once, best where high amounts of reads and less writes, and less updates, bit difficult to query data, as no fixed schema. Horizontal and vertical scaling is possible.CAP (Consistency, Availability, Partition Tolerant)and BASE (Basically Available, soft state, Eventually consistent)compliance.
Let me show an example to enter data to a non relational database using Mongodb
db.users.insertOne({name: ‘Mary’, age: 28 , occupation: ‘writer’ })
db.users.insertOne({name: ‘Ben’ , age: 21})
Hence you can understand that to the database called db, and there is a collections called users, and document called insertOne to which we add data, and there is no fixed schema as our first record has 3 attributes, and second attribute has 2 attributes only, this is no problem in non relational databases, but this cannot be done in relational databases, as relational databases have a fixed schema.
Let's look at another different example
({Studname: ‘Ash’, Subname: ‘Mathematics’, LecturerName: ‘Mr. Oak’})
Hence we can see in non relational database we can enter both student details and subject details into one document, as no relationships defined in non relational databases, but here this way can lead to data duplication, and hence errors in updating can occur therefore.
Hope this explains everything
In layman terms it's strongly structured vs unstructured, which implies that you have different degrees of adaptability for your DB.
Differences arise in indexation particularly as you need to ensure that a certain reference index can link to a another item -> this a relation. The more strict structure of relational DB comes from this requirement.
To note that NosDB apaprently provides both relational and non relational DBs and a way to query both http://www.alachisoft.com/nosdb/sql-cheat-sheet.html

NOsql Vs Mysql - Going schemaless with Cassandra

Here are the facts:
We have a lot (L O T) of data coming in everyday.
Each file we receive is in a csv format and while there are a couple of headers that reoccur more often than others, there is not really a standard.
The normalization of each file to be uploaded into a mySQL database is highly time consuming and often pushes us to change the schema (new field appeared in on file that was not existing before..).
While the primary key is unique, anything else can be duplicated
These are customers records (i.e.: email,firstname,lastname,city,state,address...etc)
We could have multiple emails for the same individual ..
We read 70% of the time and we write 30% of the time
Scalability could be a concern but it is not right now, though availability is key
Speed is what we are looking for. Mysql is too slow to answer queries where tables are over 50 million records. Even well optimized we have too many speed issue. Breaking down the tables has become an organizational concern. Schema less noSQL seemed attractive. What would you recommend, what did you implement? (Please do not answer to optimize mysql .. pointless and off topic)
--
Let's go over the points:
We have a lot (L O T) of data coming in everyday.
NoSQL solutions are basically all created to scale to large numbers (Riak, MongoDB, Cassandra, etc.)
... headers that reoccur more often than others, there is not really a standard... The normalization of each file to be uploaded into a mySQL database is highly time consuming and often pushes us to change the schema
NoSQL definitely fits this model many of them are "schema-less" so it's easy to store those extra fields. This will however cost you extra space as the field names are typically stored with the document.
While the primary key is unique, anything else can be duplicated
"Document-oriented" and "Key-Value" databases are a good fit for this as long as the key is provided. If you have to run duplicate checks, then most key-value database are ill-equipped. The "document-oriented" database might be slightly better equipped, but not by much.
We could have multiple emails for the same individual
Most of these databases have some notion of "arrays as a basic type". CouchDB and MongoDB both store objects as JSON, so it's easy to see how a customer could have an array of e-mails without the need for a "join table". MongoDB also provides "atomic update" features like "$addToSet" that plays nicely with arrays.
We read 70% of the time and we write 30% of the time
Scalability could be a concern but it is not right now, though availability is key
The major NoSQL DBs are all designed to scale. (both reads and writes)
The only way to availability is through hardware and locational redundancy (no different that MySQL or other databases). Despite their low version numbers, many of these Databases are being used in production environments by very big companies, so many of the simple cases are covered. It's still virgin territory, but we're also past the "randomly crashes when nothing has changed" phase.
Speed is what we are looking for... Schema less noSQL seemed attractive. What would you recommend, what did you implement?
We have 100s of M of flexible user records in MongoDB. Performance on individual seeks is really awesome.
However, you have to wary about the type of queries you're running.
If you need to run queries that bring back several Users at once, you're going to have speed issues with basically any of these Key-Value or Document-Oriented database. You may want to look at Graph database or some other fancy solution. However, if your use cases all center around one user at a time then take a look at MongoDB.
MongoDB also supports native map-reduce so you'll be able to scale "non-real time" queries.