mysql denormalization of multiple entities in one table? - mysql

I am building an eCommerce multichannel listing tool for ebay/amazon/sears/rakuten ... and more
each entity has its own properties. for example eBay has ebayItemId/Title/price while amazon has something like asinNumber/Title/LowestPrice
My question is should I have each one in its own table. or should I mix the entities together in one table, The column header can store different data based on the marketplace, A lot of columns might have null values.
you think this is a good approach or is it better to normalize them to multiple entities?

The way to evaluate what type of denormalization you should do is to start with the queries you need to answer, then organize the data to help the queries.
You can't find the best table structure without taking the queries into consideration.
For example solutions for your use case, see my answer to https://stackoverflow.com/a/695860/20860

It's best to have a fully normalised schema. Everything is simpler and consistent.
You only denormalise for "performance", which is a different need than the benefits that normalisation gives. So it's best to denormalise via a view, or a special table for that purpose, or another NoSQL database etc.
Make your correct normalised database the source of truth.
Populate/derive your denormalise data from the source of truth and use it for high speed read only operations. How you wire up the two is an implementation detail - there are many options depending on exactly how you implement the design.

Related

Best practices for transforming a relational database to a non relational?

I have a MySQL Database and I need to create a Mongo Database (I don't care about keeping any data).
So are there any good practices for designing the structure (mongoose.Schema) based on the relational tables of MySQL ?
For example, the SQL has a table users and a table courses with relation 1:n, should I also create two collections in MongoDB or would it be better to create a new field courses: [] inside user document and create only the user collection ?
The schema definition should be driven by the use cases of the application.
Under which conditions is data accessed and modified. Which is the leading entity.
e.g. When a user is loaded do you always also want to know the courses of the user? This would be an argument for embedding.
Can you update courses without knowing all of its users, e.g. update the name of a course? Do you want to list an overview of all courses? This would be an argument for extracting into an own collection.
So there is no general guideline for such migration as only from the schema definition, the use cases cannot be derived.
If you don't care about data, the best approach is to redesign it from scratch.
NoSQLs differ from RDBMS in many ways so direct mapping will hardly be efficient and in many cases not possible at all.
First thing you need to answer to yourself (and probably to mention in the question) is why you need to change database in the first place. There are different kind of problems that Mongo can solve better than SQL and they require different data models. None of them come for free so you will need to understand the tradeoffs.
You can start from the very simple rule: in SQL you model your data after your business objects and describe relations between them, in Mongo you model data after queries that you need to respond to. As soon as you grasp the idea it will let you ask answerable questions.
It may worth reading https://www.mongodb.com/blog/post/building-with-patterns-a-summary as a starting point.
An old yet still quite useful https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-1 Just keep in mind it was written long time ago when mongo did not have many of v4+ features. Nevertheless it describes philosophy of mongo data modelling with simple examples.It didn't change much since then.

DB table organization by entity, or vertically by level of data?

I hope the title is clear, please read further and I will explain what I mean.
We having a disagreement with our database designer about high level structure. We are designing a MySQL database and we have a trove of data that will become part of it. Conceptually, the data is complex - there are dozens of different types of entities (representing a variety of real-world entities, you could think of them as product developers, factories, products, inspections, certifications, etc.) each with associated characteristics and with relationships to each other.
I am not an experienced DB designer but everything I know tells me to start by thinking of each of these entities as a table (with associated fields representing characteristics and data populating them), to be connected as appropriate given the underlying relationships. Every example of DB design I have seen does this.
However, the data is currently in a totally different form. There are four tables, each representing a level of data. A top level table lists the 39 entity types and has a long alphanumeric string tying it to the other three tables, which represent all the entities (in one table), entity characteristics (in one table) and values of all the characteristics in the DB (in one table with tens of millions of records.) This works - we have a basic view in php which lets you navigate among the levels and view the data, etc. - but it's non-intuitive, to say the least. The reason given for having it this way is that it makes the size of the DB smaller, shortens query time and makes expansion easier. But it's not clear to me that the size of the DB means we should optimize this over, say, clarity of organization.
So the question is: is there ever a reason to structure a DB this way, and what is it? I find it difficult to get a handle on the underlying data - you can't, for example, run through a table in traditional rows-and-columns format - and it hides connections. But a more "traditional" structure with tables based on entities would result in many more tables, definitely more than 50 after normalization. Which approach seems better?
Many thanks.
OK, I will go ahead and answer my own question based on comments I got and more research they led me to. The immediate answer is yes, there can be a reason to structure a DB with very few tables and with all the data in one of them, it's an Entity-Attribute-Value database (EAV). These are characterized by:
A very unstructured approach, each fact or data point is just dumped into a big table with the characteristics necessary to understand it. This makes it easy to add more data, but it can be slow and/or difficult to get it out. An EAV is optimized for adding data and for organizational flexibility, and the payment is it's slower to access and harder to write queries, etc.
A "long and skinny" format, lots of rows, very few columns.
Because the data is "self encoded“ with its own characteristics, it is often used in situations when you know there will be lots of possible characteristics or data points but that most of them will be empty ("sparse data"). A table approach would have lots of empty cells, but an EAV doesn't really have cells, just data points.
In our particular case, we don't have sparse data. But we do have a situation where flexibility in adding data could be important. On the other hand, while I don't think that speed of access will be that important for us because this won't be a heavy-access site, I would worry about the ease of creating queries and forms. And most importantly I think this structure would be hard for us BD noobs to understand and control, so I am leaning towards the traditional model - sacrificing flexibility and maybe ease of adding new data in favor of clarity. Also, people seem to agree that large numbers of tables are OK as long as they are really called for by the data relationships. So, decision made.

SQL one-to-one relationships vs flattening

I'm using a standard SQL database and I'm trying to figure out whether or not to flatten a table or make it more "object-oriented". To me, smaller tables are easier to read but it would require joining tables and having one-to-one relationships. Is this generally a good way of doing things or is it frowned on in the SQL world?
I have a table which has the following attributes:
MYTABLE
- ID
- NAME
- LABEL
- CREATED_TS
- MODIFIED_TS
- CREATED_USER
- MODIFIED_USER
To me, the created/modified fields would be their own object. There are actually a few more fields as well so it's not really just this small. I would think that creating another table called "MYTABLE_MODINFO" or something like that which would have the CREATED and MODIFIED fields and they would be joined when data from them was needed. These tables aren't high access tables, they wouldn't have tons of queries per minute or even hundreds of rows in them, so I don't think efficiency would be much of an issue.
So mainly what I'm wondering is would this be a generally accepted design or should you generally keep your table structures flat?
You should create audit information in the same table. The reason is that this data is part of the row and is a one to one relationship, so there is no point in branching it apart.
If you want to store the audit info (audit tracking/history), then you can create another table, however in most cases I have seen this built by "duplicating" data and creating a surrogate key and mappings back to the original row. The reason I list duplicating in quotes is because auditing inherently requires duplication of the old data...if it is linked and changeable after being written, then it is not really an audit.
Just my two cents. If it does not make sense, then I can provide some examples. But, the gist is that each row will only ever have one current piece of modification information, so why break it out if it will never have more than one?
avoid a database 'one to one', you'll lose performance, scalability, independence. can you imagine what happen if you want to store 2 pictures per ID? will you create another field or will you repeat the row??... it's easier to create relationship to have more freedom when you want to upgrade, please review this tutorials.
http://www.youtube.com/watch?v=Onzm-PxSjtE
http://folkworm.ceri.memphis.edu/ew/SCHEMA_DOC/comparison/erd.htm
http://www.visual-paradigm.com/product/vpuml/provides/dbmodeling.jsp
Beside that you should normalize the DB to be sure that everything is in the best shape possible. Remember that the most important is to take what you need and adapt it.
http://databases.about.com/od/specificproducts/a/normalization.htm
http://www.youtube.com/watch?v=xzeuBwHkKxw
RDBMS design aren't the same with object-oriented approach in my view. the example you mentioned aren't different objects domain but data inheritance of your record. Since there would not be any overhead of tons of queries/execution of the table so you should keep them in the same table for auditing purpose and also easier to work with at normalize data.

Database Design: to EAV or not to EAV?

Say I have an entity that will have many attributes, some I know about now and others will be user defined. What's the best way to model this?
1) Do I have a main table and relate it to a secondary name-value pair table? All the attributes go in the secondary EAV table.
OR -
2) Do I put the most common attributes (not all users will need them, so I expect a lot of NULL entries) in the main table and have the secondary EAV table for the user defined attributes?
OR -
3) Some other approach I have not thought of?
You may use solution two for efficiency reason, in particular if you need to select often on these quantities. These values may be "cache" of the EAV table, if you want. You introduce duplication but speed up lookup.
EAV is a good solution for this problem unless you have to perform joins at the db level. An alternative is to move away from the relational model and move to a RDF based model.
Typically, lots of empty cells are cheap and not worth normalizing away. The only draw back to #2 is if you have a very large number of rows (millions - where performance problems could arise), a very large number of columns (more than about 20 - where it's just annoying to look at the data), or there are a number of unique constraints on the EAV table.
With that said, it is now 2011 and it makes sense to use a programming framework with a database abstraction layer these days so that you're not designing database relationships directly. Something like Django's Object Relational Mapper allow you to focus on the models themselves and let best practices take care of themselves (95% of the time). This tutorial will help you get started. Django only applies to web development database modeling. For non-web environments, other frameworks will be better.
I've done a lot of work with the EAV pattern, and it has served the purpose well enough. I find empty columns, or dynamic columns (like col1, col2, etc) to be much harder to deal with manage after the fact, but it can be easier to query them since you don't need as many joins.
One thing I would very strongly recommend is taking a look at options like Mongo DB. It automatically handles complex dynamic data structures.

How do you know when you need separate tables?

How do you know when to create a new table for very similar object types?
Example:
To learn mysql I'm building a model solar system. For the purposes of my project, planets have many similar attributes to dwarf planets, centaurs, and comets. Dwarf planets are almost completely identical to planets. Centaurs and comets are only different from planets because their orbital path has more variation. Should I have a separate table for each type of object, or should they share tables?
The example is probably too simple, but I'm also interested in best practices. Like should I use separate tables just in case I want to make planets and dwarf planets different in the future, or are their any efficiency reasons for keeping them in the same table.
Normal forms is what you should be interested with. They pretty much are the convention for building tables.
Any design that doesn't break the first, second or third normal form is fine by me. That's a pretty long list of requirement though, so I suggest you go read it off the Wikipedia links above.
It depends on what type of information you want to store about the objects. If the information for all of them is the same, say orbit radius, mass and name, then you can use the same table. However, if there are different properties for each (say atmosphere composition for planets, etc.) then you can either use separate tables for each (not very normalized) or have one table for basic properties like orbit, mass and name and a second table for just the properties that are unique to planets (and a similar table for comets, etc. if needed). All objects would be in the first table but only planets would be in the second table and linked through a foreign key to the first table.
It's called Database Normalization
There are many normal forms. By applying normalization you will go through metadata (tables) and study the relationsships between data more clearly. By using the normalization techniques you will optimize the tables to prevent redundancy. This process will help you understand which entities to create based on the relationsships between the different fields.
You should most likely split the data about a planet etc so that the shared (common) information is in another table.
E.g.
Common (Table)
Diameter (Column)
Mass (Column)
Planet
Population
Comet
Speed
Poor columns I know. Have the Planet and Comet tables link to the Common data with a key.
This is definitely a subjective question. It sounds like you are already on the right lines of thinking. I would ask:
Do these objects share many attributes? If so, it's probably worth considering at the very least a base table to list them all in.
Does one object "extend" another - it has all the attributes of the other, plus some extras? If so, it might be worth adding another table with the extra attributes and a one-to-one mapping back to the base object.
Do both objects have many shared attributes and unshared attributes? If this is the case, maybe you need a single table plus a "data extension" system where each object can have a type or category that specifies any amount of extra attributes that may be associated with it.
Do the objects only share one or two attributes? In this case, they are probably dissimilar enough to separate into multiple tables.
You may also ask yourself how you are going to query the data. Will you ever want to get them all in the same list? It's always a good idea to combine data into tables with other data they will commonly be queried with. For example, an "attachments" table where the file can be an image or a video, instead of images and video tables, if you commonly want to query for all attachments. Don't split into multiple tables unless there is a really good reason.
If you will ever want to get planets and comets in one single query, they will pretty much have to be in the same table if you want the database to work efficiently. Inheritance should be handled inside your app itself :)
Here's my answer to a similar question, which I think applies here as well:
How do you store business activities in a SQL database?
There are many different ways to express inheritance in your relational model. For example you can try to squish everything in to one table and have a field that allows you to distinguish between the different types or have one table for the shared attributes with relationships to a child table with the specific attributes etc... in either choice you're still storing the same information. When going from a domain model to a relational model this is what is called an impedance mismatch. Both choices have different trade offs, for example one table will be easier to query, but multiple tables will have higher data density.
In my experience it's best not to try to answer these questions from a database perspective, but let your domain model, and sometimes your application framework of choice, drive the table structure. Of course this isn't always a viable choice, especially when performance is concerned.
I recommend you start by drawing on paper the relationships you want to express and then go from there. Does the table structure you've chosen represent the domain accurately? Is it possible to query to extract the information you want to report on? Are the queries you've written complicated or slow? Answering these questions and others like them will hopefully guide you towards creating a good relational model.
I'd also suggest reading up on database normalization if you're serious about learning good relational modeling principals.
I'd probably have a table called [HeavenlyBodies] or some such thing. Then have a look up table with the type of body, ie Planet, comet, asteroid, star, etc. All will share similar things such as name, size, weight. Most of the answers I read so far all have good advise. Normalization is good, but I feel you can take it too far sometimes. 3rd normal is a good goal.