Database Design: to EAV or not to EAV? - mysql

Say I have an entity that will have many attributes, some I know about now and others will be user defined. What's the best way to model this?
1) Do I have a main table and relate it to a secondary name-value pair table? All the attributes go in the secondary EAV table.
OR -
2) Do I put the most common attributes (not all users will need them, so I expect a lot of NULL entries) in the main table and have the secondary EAV table for the user defined attributes?
OR -
3) Some other approach I have not thought of?

You may use solution two for efficiency reason, in particular if you need to select often on these quantities. These values may be "cache" of the EAV table, if you want. You introduce duplication but speed up lookup.
EAV is a good solution for this problem unless you have to perform joins at the db level. An alternative is to move away from the relational model and move to a RDF based model.

Typically, lots of empty cells are cheap and not worth normalizing away. The only draw back to #2 is if you have a very large number of rows (millions - where performance problems could arise), a very large number of columns (more than about 20 - where it's just annoying to look at the data), or there are a number of unique constraints on the EAV table.
With that said, it is now 2011 and it makes sense to use a programming framework with a database abstraction layer these days so that you're not designing database relationships directly. Something like Django's Object Relational Mapper allow you to focus on the models themselves and let best practices take care of themselves (95% of the time). This tutorial will help you get started. Django only applies to web development database modeling. For non-web environments, other frameworks will be better.

I've done a lot of work with the EAV pattern, and it has served the purpose well enough. I find empty columns, or dynamic columns (like col1, col2, etc) to be much harder to deal with manage after the fact, but it can be easier to query them since you don't need as many joins.
One thing I would very strongly recommend is taking a look at options like Mongo DB. It automatically handles complex dynamic data structures.

Related

mysql denormalization of multiple entities in one table?

I am building an eCommerce multichannel listing tool for ebay/amazon/sears/rakuten ... and more
each entity has its own properties. for example eBay has ebayItemId/Title/price while amazon has something like asinNumber/Title/LowestPrice
My question is should I have each one in its own table. or should I mix the entities together in one table, The column header can store different data based on the marketplace, A lot of columns might have null values.
you think this is a good approach or is it better to normalize them to multiple entities?
The way to evaluate what type of denormalization you should do is to start with the queries you need to answer, then organize the data to help the queries.
You can't find the best table structure without taking the queries into consideration.
For example solutions for your use case, see my answer to https://stackoverflow.com/a/695860/20860
It's best to have a fully normalised schema. Everything is simpler and consistent.
You only denormalise for "performance", which is a different need than the benefits that normalisation gives. So it's best to denormalise via a view, or a special table for that purpose, or another NoSQL database etc.
Make your correct normalised database the source of truth.
Populate/derive your denormalise data from the source of truth and use it for high speed read only operations. How you wire up the two is an implementation detail - there are many options depending on exactly how you implement the design.

DB table organization by entity, or vertically by level of data?

I hope the title is clear, please read further and I will explain what I mean.
We having a disagreement with our database designer about high level structure. We are designing a MySQL database and we have a trove of data that will become part of it. Conceptually, the data is complex - there are dozens of different types of entities (representing a variety of real-world entities, you could think of them as product developers, factories, products, inspections, certifications, etc.) each with associated characteristics and with relationships to each other.
I am not an experienced DB designer but everything I know tells me to start by thinking of each of these entities as a table (with associated fields representing characteristics and data populating them), to be connected as appropriate given the underlying relationships. Every example of DB design I have seen does this.
However, the data is currently in a totally different form. There are four tables, each representing a level of data. A top level table lists the 39 entity types and has a long alphanumeric string tying it to the other three tables, which represent all the entities (in one table), entity characteristics (in one table) and values of all the characteristics in the DB (in one table with tens of millions of records.) This works - we have a basic view in php which lets you navigate among the levels and view the data, etc. - but it's non-intuitive, to say the least. The reason given for having it this way is that it makes the size of the DB smaller, shortens query time and makes expansion easier. But it's not clear to me that the size of the DB means we should optimize this over, say, clarity of organization.
So the question is: is there ever a reason to structure a DB this way, and what is it? I find it difficult to get a handle on the underlying data - you can't, for example, run through a table in traditional rows-and-columns format - and it hides connections. But a more "traditional" structure with tables based on entities would result in many more tables, definitely more than 50 after normalization. Which approach seems better?
Many thanks.
OK, I will go ahead and answer my own question based on comments I got and more research they led me to. The immediate answer is yes, there can be a reason to structure a DB with very few tables and with all the data in one of them, it's an Entity-Attribute-Value database (EAV). These are characterized by:
A very unstructured approach, each fact or data point is just dumped into a big table with the characteristics necessary to understand it. This makes it easy to add more data, but it can be slow and/or difficult to get it out. An EAV is optimized for adding data and for organizational flexibility, and the payment is it's slower to access and harder to write queries, etc.
A "long and skinny" format, lots of rows, very few columns.
Because the data is "self encoded“ with its own characteristics, it is often used in situations when you know there will be lots of possible characteristics or data points but that most of them will be empty ("sparse data"). A table approach would have lots of empty cells, but an EAV doesn't really have cells, just data points.
In our particular case, we don't have sparse data. But we do have a situation where flexibility in adding data could be important. On the other hand, while I don't think that speed of access will be that important for us because this won't be a heavy-access site, I would worry about the ease of creating queries and forms. And most importantly I think this structure would be hard for us BD noobs to understand and control, so I am leaning towards the traditional model - sacrificing flexibility and maybe ease of adding new data in favor of clarity. Also, people seem to agree that large numbers of tables are OK as long as they are really called for by the data relationships. So, decision made.

Single table or seperate table for each user to hold similar records? (performance??)

I have 2 scenarios for a MySQL DB and I'm not sure which to choose, and I've run into the same dilemma for a few tables.
I'm making a web application only accessed by members. Each member has their own deals, expenses, and say "listings". The criteria for the records is the same across users, but each user can have completely different amounts of records.
My 2 scenarios are whether I should have one table for deals, one table for listings, one table for expenses...and have a field in each that links to the primary key for a particular user. Or...if it is better to have a separate deal table, expense table, and listing table for each user..(using a combined string like "user"+deals, or "user"+exp). Deals can be used across 1 or 2 users, but expenses and listings are completely independent. I am going to have a master deal table to hold all the info for each deal, but there is a user deal table(s) that links their primary key to a deal primary key.
So, separate tables or one table? If there are thousands of users with hundreds of deals/expenses/listings..I just don't want the queries to be extremely slow after a lot of deals or expenses have built up...No user will ever need to view anything from other users...strictly just their data.
Also, I'm familiar with how a database works and stores data, but I'm not 100% clear. I just want it to work quickly, so my other question is (although it may be stupid) when a user submits a new deal or expense...is it inserted in the beginning or end the table? Or is it irrelevant...because a query will search everything in the table either way before returning information?
Always use one table to store one kind of entity.
Or more specifically, what you're talking about is a nasty, complicated optimisation that works in an incredibly small subset of cases which almost certainly isn't yours.
You want to use just one table for one kind of entry. Index it appropriately, and try to get rid of old records when you don't need them any more.
Also, a lot of peoples' idea of "big data" isn't actually particularly big. Databases normally need little optimisation while their data still fit in RAM, which on a modern system means, say, 32Gb.
Regarding your second question:
In MySql the order of the records on the disk is defined by your PRIMARY KEY. Meaning a record does not get inserted at the end or the beginning, but rather wherever it belongs based on the primary key.
In other db's you have th option to use CLUSTERED KEYS in order to use another key than the PRIMARY to order the records on disk, but this is not supported in MySql to my knowledge.
Regarding your first question:
I found myself in this position a couple of times and recently I keep getting back to one blog post (last of a series, the conclusion is in the bottom):
http://weblogs.asp.net/manavi/archive/2011/01/03/inheritance-mapping-strategies-with-entity-framework-code-first-ctp5-part-3-table-per-concrete-type-tpc-and-choosing-strategy-guidelines.aspx
I quote:
Before we get into this discussion, I
want to emphasize that there is no one
single "best strategy fits all
scenarios" exists. As you saw, each of
the approaches have their own
advantages and drawbacks. Here are
some rules of thumb to identify the
best strategy in a particular
scenario:
If you don’t require polymorphic associations or queries, lean toward
TPC—in other words, if you never or
rarely query for BillingDetails and
you have no class that has an
association to BillingDetail base
class. I recommend TPC (Table per Concrete Type) (only) for the
top level of your class hierarchy,
where polymorphism isn’t usually
required, and when modification of the
base class in the future is unlikely.
If you do require polymorphic associations or queries, and
subclasses declare relatively few
properties (particularly if the main
difference between subclasses is in
their behavior), lean toward TPH (Table per Hierarchy). Your
goal is to minimize the number of
nullable columns and to convince
yourself (and your DBA) that a
denormalized schema won’t create
problems in the long run.
If you do require polymorphic associations or queries, and
subclasses declare many properties
(subclasses differ mainly by the data
they hold), lean toward TPT (Table per Type). Or,
depending on the width and depth of
your inheritance hierarchy and the
possible cost of joins versus unions,
use TPC.
By default, choose TPH only for simple
problems. For more complex cases (or
when you’re overruled by a data
modeler insisting on the importance of
nullability constraints and
normalization), you should consider
the TPT strategy. But at that point,
ask yourself whether it may not be
better to remodel inheritance as
delegation in the object model
(delegation is a way of making
composition as powerful for reuse as
inheritance). Complex inheritance is
often best avoided for all sorts of
reasons unrelated to persistence or
ORM. EF acts as a buffer between the
domain and relational models, but that
doesn’t mean you can ignore
persistence concerns when designing
your classes.

What is the difference between a Relational and Non-Relational Database?

MySQL, PostgreSQL and MS SQL Server are relational database systems, and NoSQL, MongoDB, etc. are non-relational DBMSs.
What are the differences between the two types of system?
Hmm, not quite sure what your question is.
In the title you ask about Databases (DB), whereas in the body of your text you ask about Database Management Systems (DBMS). The two are completely different and require different answers.
A DBMS is a tool that allows you to access a DB.
Other than the data itself, a DB is the concept of how that data is structured.
So just like you can program with Oriented Object methodology with a non-OO powered compiler, or vice-versa, so can you set-up a relational database without an RDBMS or use an RDBMS to store non-relational data.
I'll focus on what Relational Database (RDB) means and leave the discussion about what systems do to others.
A relational database (the concept) is a data structure that allows you to link information from different 'tables', or different types of data buckets. A data bucket must contain what is called a key or index (that allows to uniquely identify any atomic chunk of data within the bucket). Other data buckets may refer to that key so as to create a link between their data atoms and the atom pointed to by the key.
A non-relational database just stores data without explicit and structured mechanisms to link data from different buckets to one another.
As to implementing such a scheme, if you have a paper file with an index and in a different paper file you refer to the index to get at the relevant information, then you have implemented a relational database, albeit quite a simple one. So you see that you do not even need a computer (of course it can become tedious very quickly without one to help), similarly you do not need an RDBMS, though arguably an RDBMS is the right tool for the job. That said there are variations as to what the different tools out there can do so choosing the right tool for the job may not be all that straightforward.
I hope this is layman terms enough and is helpful to your understanding.
Relational databases have a mathematical basis (set theory, relational theory), which are distilled into SQL == Structured Query Language.
NoSQL's many forms (e.g. document-based, graph-based, object-based, key-value store, etc.) may or may not be based on a single underpinning mathematical theory. As S. Lott has correctly pointed out, hierarchical data stores do indeed have a mathematical basis. The same might be said for graph databases.
I'm not aware of a universal query language for NoSQL databases.
Most of what you "know" is wrong.
First of all, as a few of the relational gurus routinely (and sometimes stridently) point out, SQL doesn't really fit nearly as closely with relational theory as many people think. Second, most of the differences in "NoSQL" stuff has relatively little to do with whether it's relational or not. Finally, it's pretty difficult to say how "NoSQL" differs from SQL because both represent a pretty wide range of possibilities.
The one major difference that you can count on is that almost anything that supports SQL supports things like triggers in the database itself -- i.e. you can design rules into the database proper that are intended to ensure that the data is always internally consistent. For example, you can set things up so your database asserts that a person must have an address. If you do so, anytime you add a person, it will basically force you to associate that person with some address. You might add a new address or you might associate them with some existing address, but one way or another, the person must have an address. Likewise, if you delete an address, it'll force you to either remove all the people currently at that address, or associate each with some other address. You can do the same for other relationships, such as saying every person must have a mother, every office must have a phone number, etc.
Note that these sorts of things are also guaranteed to happen atomically, so if somebody else looks at the database as you're adding the person, they'll either not see the person at all, or else they'll see the person with the address (or the mother, etc.)
Most of the NoSQL databases do not attempt to provide this kind of enforcement in the database proper. It's up to you, in the code that uses the database, to enforce any relationships necessary for your data. In most cases, it's also possible to see data that's only partially correct, so even if you have a family tree where every person is supposed to be associated with parents, there can be times that whatever constraints you've imposed won't really be enforced. Some will let you do that at will. Others guarantee that it only happens temporarily, though exactly how long it can/will last can be open to question.
The relational database uses a formal system of predicates to address data. The underlying physical implementation is of no substance and can vary to optimize for certain operations, but it must always assume the relational model. In layman's terms, that's just saying I know exactly how many values (attributes) each row (tuple) in my table (relation) has and now I want to exploit the fact accordingly, thoroughly and to it's extreme. That's the true nature of the beast.
Since we're obviously the generation that has had a relational upbringing, if you look at NoSQL database models from the perspective of the relational model, again in layman's terms, the first obvious difference is that no assumptions about the number of values a row can contain is ever made. This is really oversimplifying the matter and does not cleanly apply to the intricacies of the physical models of every NoSQL database, but it's the pinnacle of the relational model and the first assumption we have to leave behind or, if you'd rather, the biggest leap we have to make.
We can agree to two things that are true for every DBMS: it can store any kind of data and has enough mathematical underpinnings to make it possible to manage the data in any way imaginable. The reality is that you'll never want to make the mistake of putting any of the two points to the test, but rather just stick with what the actual DBMS was really made for. In layman's terms: respect the beast within!
(Please note that I've avoided comparing the (obviously) well founded standards revolving around the relational model against the many flavors provided by NoSQL databases. If you'd like, consider NoSQL databases as an umbrella term for any DBMS that does not completely assume the relational model, in exclusion to everything else. The differences are too many, but that's the principal difference and the one I think would be of most use to you to understand the two.)
Try to explain this question in a level referring to a little bit technology
Take MongoDB and Traditional SQL for comparison, imagine the scenario of posting a Tweet on Twitter. This tweet contains 9 pictures. How do you store this tweet and its corresponding pictures?
In terms of traditional relationship SQL, you can store the tweets and pictures in separate tables, and represent the connection through building a new table.
What's more, you can set a field which is an image type, and zip the 9 pictures into a binary document and store it in this field.
Using MongoDB, you could build a document like this (similar to the concept of a table in relational SQL):
{
"id":"XXX",
"user":"XXX",
"date":"xxxx-xx-xx",
"content":{
"text":"XXXX",
"picture":["p1.png","p2.png","p3.png"]
}
Therefore, in my opinion, the main difference is about how do you store the data and the storage level of the relationships between them.
In this example, the data is the tweet and the pictures. The different mechanism about storage level of relationship between them also play a important role in the difference between both.
I hope this small example helps show the difference between SQL and NoSQL (ACID and BASE).
Here's a link of picture about the goals of NoSQL from the Internet:
http://icamchuwordpress-wordpress.stor.sinaapp.com/uploads/2015/01/dbc795f6f262e9d01fa0ab9b323b2dd1_b.png
The difference between relational and non-relational is exactly that. The relational database architecture provides with constraints objects such as primary keys, foreign keys, etc that allows one to tie two or more tables in a relation. This is good so that we normalize our tables which is to say split information about what the database represents into many different tables, once can keep the integrity of the data.
For example, say you have a series of table that houses information about an employee. You could not delete a record from a table without deleting all the records that pertain to such record from the other tables. In this way you implement data integrity. The non-relational database doesn't provide this constraints constructs that will allow you to implement data integrity.
Unless you don't implement this constraint in the front end application that is utilized to populate the databases' tables, you are implementing a mess that can be compared with the wild west.
First up let me start by saying why we need a database.
We need a database to help organise information in such a manner that we can retrieve that data stored in a efficient manner.
Examples of relational database management systems(SQL):
1)Oracle Database
2)SQLite
3)PostgreSQL
4)MySQL
5)Microsoft SQL Server
6)IBM DB2
Examples of non relational database management systems(NoSQL)
1)MongoDB
2)Cassandra
3)Redis
4)Couchbase
5)HBase
6)DocumentDB
7)Neo4j
Relational databases have normalized data, as in information is stored in tables in forms of rows and columns, and normally when data is in normalized form, it helps to reduce data redundancy, and the data in tables are normally related to each other, so when we want to retrieve the data, we can query the data by using join statements and retrieve data as per our need.This is suited when we want to have more writes, less reads, and not much data involved, also its really easy relatively to update data in tables than in non relational databases. Horizontal scaling not possible, vertical scaling possible to some extent.CAP(Consistency, Availability, Partition Tolerant), and ACID (Atomicity, Consistency, Isolation, Duration)compliance.
Let me show entering data to a relational database using PostgreSQL as an example.
First create a product table as follows:
CREATE TABLE products (
product_no integer,
name text,
price numeric
);
then insert the data
INSERT INTO products (product_no, name, price) VALUES (1, 'Cheese', 9.99);
Let's look at another different example:
Here in a relational database, we can link the student table and subject table using relationships, via foreign key, subject ID, but in a non relational database no need to have two documents, as no relationships, so we store all the subject details and student details in one document say student document, then data is getting duplicated, which makes updating records troublesome.
In non relational databases, there is no fixed schema, data is not normalized. no relationships between data is created, all data mostly put in one document. Well suited when handling lots of data, and can transfer lots of data at once, best where high amounts of reads and less writes, and less updates, bit difficult to query data, as no fixed schema. Horizontal and vertical scaling is possible.CAP (Consistency, Availability, Partition Tolerant)and BASE (Basically Available, soft state, Eventually consistent)compliance.
Let me show an example to enter data to a non relational database using Mongodb
db.users.insertOne({name: ‘Mary’, age: 28 , occupation: ‘writer’ })
db.users.insertOne({name: ‘Ben’ , age: 21})
Hence you can understand that to the database called db, and there is a collections called users, and document called insertOne to which we add data, and there is no fixed schema as our first record has 3 attributes, and second attribute has 2 attributes only, this is no problem in non relational databases, but this cannot be done in relational databases, as relational databases have a fixed schema.
Let's look at another different example
({Studname: ‘Ash’, Subname: ‘Mathematics’, LecturerName: ‘Mr. Oak’})
Hence we can see in non relational database we can enter both student details and subject details into one document, as no relationships defined in non relational databases, but here this way can lead to data duplication, and hence errors in updating can occur therefore.
Hope this explains everything
In layman terms it's strongly structured vs unstructured, which implies that you have different degrees of adaptability for your DB.
Differences arise in indexation particularly as you need to ensure that a certain reference index can link to a another item -> this a relation. The more strict structure of relational DB comes from this requirement.
To note that NosDB apaprently provides both relational and non relational DBs and a way to query both http://www.alachisoft.com/nosdb/sql-cheat-sheet.html

MySQL / Rails Performance: One table, many rows vs. many tables, less rows?

In my Rails App I've several models dealing with assets (attachments, pictures, logos etc.). I'm using attachment_fu and so far I have 3 different tables for storing the information in my MySQL DB.
I'm wondering if it makes a difference in the performance if I used STI and put all the information in just 1 table, using a type column and having different, inherited classes. It would be more DRY and easier to maintain, because all share many attributes and characteristics.
But what's faster? Many tables and less rows per table or just one table with many rows? Or is there no difference at all? I'll have to deal with a lot of information and many queries per second.
Thanks for your opinion!
Many tables and fewer rows is probably faster.
That's not why you should do it, though: your database ought to model your Problem Domain. One table is a poor model of many entity types. So you'll end up writing lots and lots of code to find the subset of that table that represents the entity type you're currently concerned with.
Regular, accepted, clean database and front-end client code won't work, because of your one-table-that-is-all-things-and-no-thing-at-all.
It's slower, more fragile, will multiply your code all over you app, and makes a poor model.
Do this only if all the things have exactly the same attributes and the same (or possibly Liskov substitutable) semantic meaning in your problem domain.
Otherwise, just don't even try to do this.
Or if you do, ask why this is any better than having one big Map/hash table/associative array to hold all entities in your app (and lots of functions, most of them duplicared, cut and paste, and out of date, doing switch cases or RTTI to figure out the real type of each entity).
The only way to know for sure is to try both approaches and measure the performance.
In general terms, it depends if you're doing joins across those tables and if you are, how the tables are indexed. Generally speaking, database joins are expensive which is why database schemas are sometimes denormalized to improve performance. This doesn't usually happen until you're dealing with a serious amount of data though i.e. millions of records. You probably don't have that problem yet and maybe never will.
If rows have same attributes then, yes, one table is very better, and just one row to specify type of data, otherwise, use differents tables to deal with, that better in performance, code amount and even in the lisibility of code aswell.