Related
This is a question that has probably been asked before, but I'm having some difficulty to find exactly my case, so I'll explain my situation in search for some feedback:
I have an application that will be registering locations, I have several types of locations, each location type has a different set of attributes, but I need to associate notes to locations regardless of their type and also other types of content (mostly multimedia entries and comments) to said notes. With this in mind, I came up with a couple of solutions:
Create a table for each location type, and a "notes" table for every location table with a foreign key, this is pretty troublesome because I would have to create a multimedia and comments table for every comments table, e.g.:
LocationTypeA
ID
Attr1
Attr2
LocationTypeA_Notes
ID
Attr1
...
LocationTypeA_fk
LocationTypeA_Notes_Multimedia
ID
Attr1
...
LocationTypeA_Notes_fk
And so on, this would be quite annoying to do, but after it's done, developing on this structure should not be so troublesome.
Create a table with a unique identifier for the location and point content there, like so:
Location
ID
LocationTypeA
ID
Attr1
Attr2
Location_fk
Notes
ID
Attr1
...
Location_fk
Multimedia
ID
Attr1
...
Notes_fk
As you see, this is far more simple and also easier to develop, but I just don't like the looks of that table with only IDs (yeah, that's truly the only objection I have to this, it's the option I like the most, to be honest).
Similar to option 2, but I would have an enormous table of attributes shaped like this:
Location
ID
Type
Attribute
Name
Value
And so on, or a table for each attribute; a la Drupal. This would be a pain to develop because then it would take several insert/update operations to do something on a location and the Attribute table would be several times bigger than the location table (or end up with an enormous amount of attribute tables); it also has the same issue of the surrogate-keys-only table (just it has a "type" now, which I would use to define the behavior of the location programmatically), but it's a pretty solution.
So, to the question: which would be a better solution performance and scalability-wise?, which would you go with or which alternatives would you propose? I don't have a problem implementing any of these, options 2 and 3 would be an interesting development, I've never done something like that, but I don't want to go with an option that will collapse on itself when the content grows a bit; you're probably thinking "why not just use Drupal if you know it works like you expect it to?", and I'm thinking "you obviously don't know how difficult it is to use Drupal, either that or you're an expert, which I'm most definitely not".
Also, now that I've written all of this, do you think option 2 is a good idea overall?, do you know of a better way to group entities / simulate inheritance? (please, don't say "just use inheritance!", I'm restricted to using MySQL).
Thanks for your feedback, I'm sorry if I wrote too much and meant too little.
ORM systems usually use the following, mostly the same solutions as you listed there:
One table per hierarchy
Pros:
Simple approach.
Easy to add new classes, you just need to add new columns for the additional data.
Supports polymorphism by simply changing the type of the row.
Data access is fast because the data is in one table.
Ad-hoc reporting is very easy because all of the data is found in one table.
Cons:
Coupling within the class hierarchy is increased because all classes are directly coupled to the same table.
A change in one class can affect the table which can then affect the other classes in the hierarchy.
Space potentially wasted in the database.
Indicating the type becomes complex when significant overlap between types exists.
Table can grow quickly for large hierarchies.
When to use:
This is a good strategy for simple and/or shallow class hierarchies where there is little or no overlap between the types within the hierarchy.
One table per concrete class
Pros:
Easy to do ad-hoc reporting as all the data you need about a single class is stored in only one table.
Good performance to access a single object’s data.
Cons:
When you modify a class you need to modify its table and the table of any of its subclasses. For example if you were to add height and weight to the Person class you would need to add columns to the Customer, Employee, and Executive tables.
Whenever an object changes its role, perhaps you hire one of your customers, you need to copy the data into the appropriate table and assign it a new POID value (or perhaps you could reuse the existing POID value).
It is difficult to support multiple roles and still maintain data integrity. For example, where would you store the name of someone who is both a customer and an employee?
When to use:
When changing types and/or overlap between types is rare.
One table per class
Pros:
Easy to understand because of the one-to-one mapping.
Supports polymorphism very well as you merely have records in the appropriate tables for each type.
Very easy to modify superclasses and add new subclasses as you merely need to modify/add one table.
Data size grows in direct proportion to growth in the number of objects.
Cons:
There are many tables in the database, one for every class (plus tables to maintain relationships).
Potentially takes longer to read and write data using this technique because you need to access multiple tables. This problem can be alleviated if you organize your database intelligently by putting each table within a class hierarchy on different physical disk-drive platters (this assumes that the disk-drive heads all operate independently).
Ad-hoc reporting on your database is difficult, unless you add views to simulate the desired tables.
When to use:
When there is significant overlap between types or when changing types is common.
Generic Schema
Pros:
Works very well when database access is encapsulated by a robust persistence framework.
It can be extended to provide meta data to support a wide range of mappings, including relationship mappings. In short, it is the start at a mapping meta data engine.
It is incredibly flexible, enabling you to quickly change the way that you store objects because you merely need to update the meta data stored in the Class, Inheritance, Attribute, and AttributeType tables accordingly.
Cons:
Very advanced technique that can be difficult to implement at first.
It only works for small amounts of data because you need to access many database rows to build a single object.
You will likely want to build a small administration application to maintain the meta data.
Reporting against this data can be very difficult due to the need to access several rows to obtain the data for a single object.
When to use:
For complex applications that work with small amounts of data, or for applications where you data access isn’t very common or you can pre-load data into caches.
I am working on a project which involves building a social network-style application allowing users to share inventory/product information within their network (for sourcing).
I am a decent programmer, but I am admittedly not an expert with databases; even more so when it comes to database design. Currently, user/company information is stored via a relational database scheme in MySQL which is working perfectly.
My problem is that while my relational scheme works brilliantly for user/company information, it is confusing me on how to implement inventory information. The issue is that each "inventory list" will definitely contain differing attributes specific to the product type, but identical to the attributes of each other product in the list. My first thought was to create a table for each "inventory list". However, I feel like this would be very messy and would complicate future attempts at KDD. I also (briefly) considered using a 'master inventory' and storing the information (e.g. the variable categories and data as a JSON string. But I figured JSON strings MySQL would just become a larger pain in the ass.
My question is essentially how would someone else solve this problem? Or, more generally, sticking with principles of relational database management, what is the "correct" way to associate unique, large data sets of similar type with a parent user? The thing is, I know I could easily jerry-build something that would work, but I am genuinely interested in what the consensus is on how to solve this problem.
Thanks!
I would check out this post: Entity Attribute Value Database vs. strict Relational Model Ecommerce
The way I've always seen this done is to make a base table for inventory that stores universally common fields. A product id, a product name, etc.
Then you have another table that has dynamic attributes. A very popular example of this is Wordpress. If you look at their data model, they use this idea heavily.
One of the good things about this approach is that it's flexible. One of the major negatives is that it's slow and can produce complex code.
I'll throw out an alternative of using a document database. In that case, each document can have a different schema/structure and you can still run queries against them.
I'm currently choosing between two different database designs. One complicated which separates data better then the more simple one. The more complicated design will require more complex queries, while the simpler one will have a couple of null fields.
Consider the examples below:
Complicated:
Simpler:
The above examples are for separating regular users and Facebook users (they will access the same data, eventually, but login differently). On the first example, the data is clearly separated. The second example is way simplier, but will have at least one null field per row. facebookUserId will be null if it's a normal user, while username and password will be null if it's a Facebook-user.
My question is: what's prefered? Pros/cons? Which one is easiest to maintain over time?
First, what Kirk said. It's a good summary of the likely consequences of each alternative design. Second, it's worth knowing what others have done with the same problem.
The case you outline is known in ER modeling circles as "ER specialization". ER specialization is just different wording for the concept of subclasses. The diagrams you present are two different ways of implementing subclasses in SQL tables. The first goes under the name "Class Table Inheritance". The second goes under the name "Single Table Inheritance".
If you do go with Class table inheritance, you will want to apply yet another technique, that goes under the name "shared primary key". In this technique, the id fields of facebookusers and normalusers will be copies of the id field from users. This has several advantages. It enforces the one-to-one nature of the relationship. It saves an extra foreign key in the subclass tables. It automatically provides the index needed to make the joins run faster. And it allows a simple easy join to put specialized data and generalized data together.
You can look up "ER specialization", "single-table-inheritance", "class-table-inheritance", and "shared-primary-key" as tags here in SO. Or you can search for the same topics out on the web. The first thing you will learn is what Kirk has summarized so well. Beyond that, you'll learn how to use each of the techniques.
Great question.
This applies to any abstraction you might choose to implement, whether in code or database. Would you write a separate class for the Facebook user and the 'normal' user, or would you handle the two cases in a single class?
The first option is the more complicated. Why is it complicated? Because it's more extensible. You could easily include additional authentication methods (a table for Twitter IDs, for example), or extend the Facebook table to include... some other facebook specific information. You have extracted the information specific to each authentication method into its own table, allowing each to stand alone. This is great!
The trade off is that it will take more effort to query, it will take more effort to select and insert, and it's likely to be messier. You don't want a dozen tables for a dozen different authentication methods. And you don't really want two tables for two authentication methods unless you're getting some benefit from it. Are you going to need this flexibility? Authentication methods are all similar - they'll have a username and password. This abstraction lets you store more method-specific information, but does that information exist?
Second option is just the reverse the first. Easier, but how will you handle future authentication methods and what if you need to add some authentication method specific information?
Personally I'd try to evaluate how important this authentication component is to the system. Remember YAGNI - you aren't gonna need it - and don't overdesign. Unless you need that extensibility that the first option provides, go with the second. You can always extract it at a later date if necessary.
This depends on the database you are using. For example Postgres has table inheritance that would be great for your example, have a look here:
http://www.postgresql.org/docs/9.1/static/tutorial-inheritance.html
Now if you do not have table inheritance you could still create views to simplify your queries, so the "complicated" example is a viable choice here.
Now if you have infinite time than I would go for the first one (for this one simple example and prefered with table inheritance).
However, this is making things more complicated and so will cost you more time to implement and maintain. If you have many table hierarchies like this it can also have a performance impact (as you have to join many tables). I once developed a database schema that made excessive use of such hierarchies (conceptually). We finally decided to keep the hierarchies conceptually but flatten the hierarchies in the implementation as it had gotten so complex that is was not maintainable anymore.
When you flatten the hierarchy you might consider not using null values, as this can also prove to make things a lot harder (alternatively you can use a -1 or something).
Hope these thoughts help you!
Warning bells are ringing loudly with the presence of two the very similar tables facebookusers and normalusers. What if you get a 3rd type? Or a 10th? This is insane,
There should be one user table with an attribute column to show the type of user. A user is a user.
Keep the data model as simple as you possibly can. Don't build it too much kung fu via data structure. Leave that for the application, which is far easier to alter than altering a database!
Let me dare suggest a third. You could introduce 1 (or 2) tables that will cater for extensibility. I personally try to avoid designs that will introduce (read: pollute) an entity model with non-uniformly applicable columns. Have the third table (after the fashion of the EAV model) contain a many-to-one relationship with your users table to cater for multiple/variable user related field.
I'm not sure what your current/short term needs are, but re-engineering your app to cater for maybe, twitter or linkedIn users might be painful. If you can abstract the content of the facebookUserId column into an attribute table like so
user_attr{
id PK
user_id FK
login_id
}
Now, the above definition is ambiguous enough to handle your current needs. If done right, the EAV should look more like this :
user_attr{
id PK
user_id FK
login_id
login_id_type FK
login_id_status //simple boolean flag to set the validity of a given login
}
Where login_id_type will be a foreign key to an attribute table listing the various login types you currently support. This gives you and your users flexibility in that your users can have multiple logins using different external services without you having to change much of your existing system
I hope the title is clear, please read further and I will explain what I mean.
We having a disagreement with our database designer about high level structure. We are designing a MySQL database and we have a trove of data that will become part of it. Conceptually, the data is complex - there are dozens of different types of entities (representing a variety of real-world entities, you could think of them as product developers, factories, products, inspections, certifications, etc.) each with associated characteristics and with relationships to each other.
I am not an experienced DB designer but everything I know tells me to start by thinking of each of these entities as a table (with associated fields representing characteristics and data populating them), to be connected as appropriate given the underlying relationships. Every example of DB design I have seen does this.
However, the data is currently in a totally different form. There are four tables, each representing a level of data. A top level table lists the 39 entity types and has a long alphanumeric string tying it to the other three tables, which represent all the entities (in one table), entity characteristics (in one table) and values of all the characteristics in the DB (in one table with tens of millions of records.) This works - we have a basic view in php which lets you navigate among the levels and view the data, etc. - but it's non-intuitive, to say the least. The reason given for having it this way is that it makes the size of the DB smaller, shortens query time and makes expansion easier. But it's not clear to me that the size of the DB means we should optimize this over, say, clarity of organization.
So the question is: is there ever a reason to structure a DB this way, and what is it? I find it difficult to get a handle on the underlying data - you can't, for example, run through a table in traditional rows-and-columns format - and it hides connections. But a more "traditional" structure with tables based on entities would result in many more tables, definitely more than 50 after normalization. Which approach seems better?
Many thanks.
OK, I will go ahead and answer my own question based on comments I got and more research they led me to. The immediate answer is yes, there can be a reason to structure a DB with very few tables and with all the data in one of them, it's an Entity-Attribute-Value database (EAV). These are characterized by:
A very unstructured approach, each fact or data point is just dumped into a big table with the characteristics necessary to understand it. This makes it easy to add more data, but it can be slow and/or difficult to get it out. An EAV is optimized for adding data and for organizational flexibility, and the payment is it's slower to access and harder to write queries, etc.
A "long and skinny" format, lots of rows, very few columns.
Because the data is "self encoded“ with its own characteristics, it is often used in situations when you know there will be lots of possible characteristics or data points but that most of them will be empty ("sparse data"). A table approach would have lots of empty cells, but an EAV doesn't really have cells, just data points.
In our particular case, we don't have sparse data. But we do have a situation where flexibility in adding data could be important. On the other hand, while I don't think that speed of access will be that important for us because this won't be a heavy-access site, I would worry about the ease of creating queries and forms. And most importantly I think this structure would be hard for us BD noobs to understand and control, so I am leaning towards the traditional model - sacrificing flexibility and maybe ease of adding new data in favor of clarity. Also, people seem to agree that large numbers of tables are OK as long as they are really called for by the data relationships. So, decision made.
MySQL, PostgreSQL and MS SQL Server are relational database systems, and NoSQL, MongoDB, etc. are non-relational DBMSs.
What are the differences between the two types of system?
Hmm, not quite sure what your question is.
In the title you ask about Databases (DB), whereas in the body of your text you ask about Database Management Systems (DBMS). The two are completely different and require different answers.
A DBMS is a tool that allows you to access a DB.
Other than the data itself, a DB is the concept of how that data is structured.
So just like you can program with Oriented Object methodology with a non-OO powered compiler, or vice-versa, so can you set-up a relational database without an RDBMS or use an RDBMS to store non-relational data.
I'll focus on what Relational Database (RDB) means and leave the discussion about what systems do to others.
A relational database (the concept) is a data structure that allows you to link information from different 'tables', or different types of data buckets. A data bucket must contain what is called a key or index (that allows to uniquely identify any atomic chunk of data within the bucket). Other data buckets may refer to that key so as to create a link between their data atoms and the atom pointed to by the key.
A non-relational database just stores data without explicit and structured mechanisms to link data from different buckets to one another.
As to implementing such a scheme, if you have a paper file with an index and in a different paper file you refer to the index to get at the relevant information, then you have implemented a relational database, albeit quite a simple one. So you see that you do not even need a computer (of course it can become tedious very quickly without one to help), similarly you do not need an RDBMS, though arguably an RDBMS is the right tool for the job. That said there are variations as to what the different tools out there can do so choosing the right tool for the job may not be all that straightforward.
I hope this is layman terms enough and is helpful to your understanding.
Relational databases have a mathematical basis (set theory, relational theory), which are distilled into SQL == Structured Query Language.
NoSQL's many forms (e.g. document-based, graph-based, object-based, key-value store, etc.) may or may not be based on a single underpinning mathematical theory. As S. Lott has correctly pointed out, hierarchical data stores do indeed have a mathematical basis. The same might be said for graph databases.
I'm not aware of a universal query language for NoSQL databases.
Most of what you "know" is wrong.
First of all, as a few of the relational gurus routinely (and sometimes stridently) point out, SQL doesn't really fit nearly as closely with relational theory as many people think. Second, most of the differences in "NoSQL" stuff has relatively little to do with whether it's relational or not. Finally, it's pretty difficult to say how "NoSQL" differs from SQL because both represent a pretty wide range of possibilities.
The one major difference that you can count on is that almost anything that supports SQL supports things like triggers in the database itself -- i.e. you can design rules into the database proper that are intended to ensure that the data is always internally consistent. For example, you can set things up so your database asserts that a person must have an address. If you do so, anytime you add a person, it will basically force you to associate that person with some address. You might add a new address or you might associate them with some existing address, but one way or another, the person must have an address. Likewise, if you delete an address, it'll force you to either remove all the people currently at that address, or associate each with some other address. You can do the same for other relationships, such as saying every person must have a mother, every office must have a phone number, etc.
Note that these sorts of things are also guaranteed to happen atomically, so if somebody else looks at the database as you're adding the person, they'll either not see the person at all, or else they'll see the person with the address (or the mother, etc.)
Most of the NoSQL databases do not attempt to provide this kind of enforcement in the database proper. It's up to you, in the code that uses the database, to enforce any relationships necessary for your data. In most cases, it's also possible to see data that's only partially correct, so even if you have a family tree where every person is supposed to be associated with parents, there can be times that whatever constraints you've imposed won't really be enforced. Some will let you do that at will. Others guarantee that it only happens temporarily, though exactly how long it can/will last can be open to question.
The relational database uses a formal system of predicates to address data. The underlying physical implementation is of no substance and can vary to optimize for certain operations, but it must always assume the relational model. In layman's terms, that's just saying I know exactly how many values (attributes) each row (tuple) in my table (relation) has and now I want to exploit the fact accordingly, thoroughly and to it's extreme. That's the true nature of the beast.
Since we're obviously the generation that has had a relational upbringing, if you look at NoSQL database models from the perspective of the relational model, again in layman's terms, the first obvious difference is that no assumptions about the number of values a row can contain is ever made. This is really oversimplifying the matter and does not cleanly apply to the intricacies of the physical models of every NoSQL database, but it's the pinnacle of the relational model and the first assumption we have to leave behind or, if you'd rather, the biggest leap we have to make.
We can agree to two things that are true for every DBMS: it can store any kind of data and has enough mathematical underpinnings to make it possible to manage the data in any way imaginable. The reality is that you'll never want to make the mistake of putting any of the two points to the test, but rather just stick with what the actual DBMS was really made for. In layman's terms: respect the beast within!
(Please note that I've avoided comparing the (obviously) well founded standards revolving around the relational model against the many flavors provided by NoSQL databases. If you'd like, consider NoSQL databases as an umbrella term for any DBMS that does not completely assume the relational model, in exclusion to everything else. The differences are too many, but that's the principal difference and the one I think would be of most use to you to understand the two.)
Try to explain this question in a level referring to a little bit technology
Take MongoDB and Traditional SQL for comparison, imagine the scenario of posting a Tweet on Twitter. This tweet contains 9 pictures. How do you store this tweet and its corresponding pictures?
In terms of traditional relationship SQL, you can store the tweets and pictures in separate tables, and represent the connection through building a new table.
What's more, you can set a field which is an image type, and zip the 9 pictures into a binary document and store it in this field.
Using MongoDB, you could build a document like this (similar to the concept of a table in relational SQL):
{
"id":"XXX",
"user":"XXX",
"date":"xxxx-xx-xx",
"content":{
"text":"XXXX",
"picture":["p1.png","p2.png","p3.png"]
}
Therefore, in my opinion, the main difference is about how do you store the data and the storage level of the relationships between them.
In this example, the data is the tweet and the pictures. The different mechanism about storage level of relationship between them also play a important role in the difference between both.
I hope this small example helps show the difference between SQL and NoSQL (ACID and BASE).
Here's a link of picture about the goals of NoSQL from the Internet:
http://icamchuwordpress-wordpress.stor.sinaapp.com/uploads/2015/01/dbc795f6f262e9d01fa0ab9b323b2dd1_b.png
The difference between relational and non-relational is exactly that. The relational database architecture provides with constraints objects such as primary keys, foreign keys, etc that allows one to tie two or more tables in a relation. This is good so that we normalize our tables which is to say split information about what the database represents into many different tables, once can keep the integrity of the data.
For example, say you have a series of table that houses information about an employee. You could not delete a record from a table without deleting all the records that pertain to such record from the other tables. In this way you implement data integrity. The non-relational database doesn't provide this constraints constructs that will allow you to implement data integrity.
Unless you don't implement this constraint in the front end application that is utilized to populate the databases' tables, you are implementing a mess that can be compared with the wild west.
First up let me start by saying why we need a database.
We need a database to help organise information in such a manner that we can retrieve that data stored in a efficient manner.
Examples of relational database management systems(SQL):
1)Oracle Database
2)SQLite
3)PostgreSQL
4)MySQL
5)Microsoft SQL Server
6)IBM DB2
Examples of non relational database management systems(NoSQL)
1)MongoDB
2)Cassandra
3)Redis
4)Couchbase
5)HBase
6)DocumentDB
7)Neo4j
Relational databases have normalized data, as in information is stored in tables in forms of rows and columns, and normally when data is in normalized form, it helps to reduce data redundancy, and the data in tables are normally related to each other, so when we want to retrieve the data, we can query the data by using join statements and retrieve data as per our need.This is suited when we want to have more writes, less reads, and not much data involved, also its really easy relatively to update data in tables than in non relational databases. Horizontal scaling not possible, vertical scaling possible to some extent.CAP(Consistency, Availability, Partition Tolerant), and ACID (Atomicity, Consistency, Isolation, Duration)compliance.
Let me show entering data to a relational database using PostgreSQL as an example.
First create a product table as follows:
CREATE TABLE products (
product_no integer,
name text,
price numeric
);
then insert the data
INSERT INTO products (product_no, name, price) VALUES (1, 'Cheese', 9.99);
Let's look at another different example:
Here in a relational database, we can link the student table and subject table using relationships, via foreign key, subject ID, but in a non relational database no need to have two documents, as no relationships, so we store all the subject details and student details in one document say student document, then data is getting duplicated, which makes updating records troublesome.
In non relational databases, there is no fixed schema, data is not normalized. no relationships between data is created, all data mostly put in one document. Well suited when handling lots of data, and can transfer lots of data at once, best where high amounts of reads and less writes, and less updates, bit difficult to query data, as no fixed schema. Horizontal and vertical scaling is possible.CAP (Consistency, Availability, Partition Tolerant)and BASE (Basically Available, soft state, Eventually consistent)compliance.
Let me show an example to enter data to a non relational database using Mongodb
db.users.insertOne({name: ‘Mary’, age: 28 , occupation: ‘writer’ })
db.users.insertOne({name: ‘Ben’ , age: 21})
Hence you can understand that to the database called db, and there is a collections called users, and document called insertOne to which we add data, and there is no fixed schema as our first record has 3 attributes, and second attribute has 2 attributes only, this is no problem in non relational databases, but this cannot be done in relational databases, as relational databases have a fixed schema.
Let's look at another different example
({Studname: ‘Ash’, Subname: ‘Mathematics’, LecturerName: ‘Mr. Oak’})
Hence we can see in non relational database we can enter both student details and subject details into one document, as no relationships defined in non relational databases, but here this way can lead to data duplication, and hence errors in updating can occur therefore.
Hope this explains everything
In layman terms it's strongly structured vs unstructured, which implies that you have different degrees of adaptability for your DB.
Differences arise in indexation particularly as you need to ensure that a certain reference index can link to a another item -> this a relation. The more strict structure of relational DB comes from this requirement.
To note that NosDB apaprently provides both relational and non relational DBs and a way to query both http://www.alachisoft.com/nosdb/sql-cheat-sheet.html