Group chats and private chats, seperate table or single table with type attribute? - mysql

Currently I have a table users, a table chats, however I want there to be "Group chats" and "Private chats (dm)".
A group chat needs more data than a private chat, such as for example: Group name, picture, ....
What is the best way to approach this?
Do I make 1 table chats, and put a type attribute in there that deteremines if it is private or not and leave some columns blank if it is a private chat. OR Would I make 2 tables one for private chats, and one for group chats?

This is a similar scenario to the general question "should you split sensitive columns into a new table" and the general answer is the same, it is going to depend largely on your data access code and your security framework.
What about a third option, why not just model a Private Chat as a Group Chat that only has 2 members in the group?. Sometimes splitting the model into these types is a premature optimisation, especially in the context of a chat style application. For instance, couldn't a private chat benefit from having an image in the same way that a group chat does? Could there not be some benefit to users being able to specify a group name to their own private group?
You will find the whole development and management of your application a lot simpler if there is just one type of chat and it is up to the user to decide how many people can join or indeed if other people can join the chat.
If you still want to explore the 2 conceptual types this is this is an answer that might give you some indirect insights: https://stackoverflow.com/a/74398184/1690217 but ultimately we need additional information to justify selecting one structure over the other. Performance, Security and general data governance are some considerations that have implications or impose caveats on implementation.
From a structural point of view, your Group Chats and Private Chats can be both implementations of a common Chat table, conceptually we could say that both forms inherit from Chat.
In relational databases we have 3 general options to model inheritance:
Table Per Hierarchy (TPH)
Use a single table with a discriminator column that determines for each row what the specific type is. Then in your application layer or via views you can query the specific fields that each type and scenario needs.
In TPH the base type is usually an abstract type definition
Table Per Type (TPT)
The base type and each concrete type exists as their own separate tables. The FK from the inheriting tables is the PK and shares the same PK value as the corresponding record in the base table, creating a 1:0-1 relationship. This requires some slightly more complicated data access logic but it makes it harder to accidentally retrieve a Private Chat in a Group Chat context because the data needs to be queried explicitly from the correct table.
In TPT the base type is itself a concrete type and data records do not have to inherit into the extended types at all.
Simple Isolated Tables (No inheritance in the schema)
This is often the simplest approach, if your tables do have inheritance in the application logic then the common properties would be replicated in each table. This can result in a lot of redundant data access logic, but the OO inheritance in the application layer following DRY principal solves most of code redundancy issues.
This answer to How can you represent inheritance in a database? covers DB inheritance from a more academic and researched point of view.
From a performance point of view, there are benefits to isolating workloads if the usage pattern is significantly different. So if Group Chats have a different usage profile, either the frequency or type of queries is significantly different, or the additional fields in Group Chat would benefit from their own index profiles, then splitting the tables will allow your database engine to provide better index management and execution plan optimisations due to more accurate capture of table statistics.
From a security and compliance point of view, a single table implementation (TPH) can reduce the data access logic and therefore the overall attack surface of the code. But a good ORM or code generation strategy usually mitigates any issues that might be raised in this space. Conversely TPH or simple tables make it easier to define database or schema level security policies and constraints.
Ultimately, which solution is best for you will come down to the effort required to implement and maintain the application logic for your choice.
I will sometimes use a mix of TPT and TPH in the same database but often lean towards TPT if I need inheritance within the data schema, this old post explains my reasoning against TPH: Database Design: Discrimator vs Separate Tables with regard to Constraints. My general rule is that if the type needs to be polymorphic, either to be considered of both types or for the type context to somehow change dynamically in the application runtime, then TPT or no inheritance is simpler to implement.
I use TPH when the differences between the types is minimal and not expected to reasonably diverge too much over the application life time, but also when the usage and implementations are going to be very similar.
TPT provides a way to express inheritance but also maintain a branch of vastly different behaviours or interactions (on top of the base implementation). many TPT implementations look as if they might as well have been separate tables, the desire to constrain the 1:1 link between the records is often a strong decider when choosing this architectural pattern. A good way to think about this model, even if you do not use inheritance at the application logic level, is that you can extend the base record to include the metadata and behaviours of any of the inheriting types. In fact with TPT it is hard to constrain the data records such that you cannot extend into multiple types.
Due to this limitation, TPT can often be modelled from the application layer as not using OO Inheritance at all
TPT complements Composition over Inheritance
TPH is often the default way to model a Domain Model that implements simple inheritance, but this introduces a problem in application logic if you need to change the type or is incompatible with the idea that a single record could be both types. There are simple workarounds for this, but historically this causes issues from a code maintenance point of view, it's a clash of concepts really, TPH aligns with Inheritance more than Composition
In the context of Chat, TPT can work from a Composition point of view. All chats have the same basic features and interactions, but Group Chat records can have extended metadata and behaviours. Unless you envision Private Chat having a lot of its own specific implementation there is not really a reason to extend the base concept of Chat to a Private Chat implementation if there is no difference in that implementation.
For that reason too though, is there a need to differentiate between Private and Group chats at all from a database perspective? Your application runtime shouldn't be using blind SELECT * style queries to access the data in either case, it should be requesting the specific fields that it needs for the given context, whether you use a Field in the table, or the Name of the table to discrimate between the different concepts is less important than being able to justify the existence of or the difference between those concepts.

Related

Is using a Master Table for shared columns good practice for an entire database?

Below, I explain a basic design for a database I am working on. As I am not a DB, I am concerned if I am on a good track or a bad one so I wanted to float this on stack for some advice. I was not able to find a similar discussion that fit's my design.
In my database, every table is considered an entity. An Entity could be a customer account, a person, a user, a set of employee information, contractor information, a truck, a plane, a product, a support ticket, etc etc. Here are my current entities (Tables)...
People
Users
Accounts
AccountUsers
Addresses
Employee Information
Contractor Information
And to store information about these Entities I have two tables:
Entity Tables
-EntityType
-> EntityTypeID (INT)
-Entities
-> EntityID (BIGINT)
-> EnitityType (INT) : foreign key
Every table I have made has an Auto Generated primary key, and a foreign key on an entityID column to the entities table.
In the entities table I have some shared fields like,
DateCreated
DateModified
User_Created
User_Modified
IsDeleted
CanUIDelete
I use triggers on all of the table's to automatically create their entity entry with the correct entity type on inserts. And update triggers update the LastModified date.
From an application layer point of view, all the code has to do is worry about the individual entities (except for the USER_Modified/User_Created fields "it does updates on that" by joining on the entityID)
Now the reason for the entities table, is down the line I plan on having an EAV model, so every entity type can be extended with custom fields. It also serves as a decent place to store metadata about the entities (like the created/modified fields).
I'm just new to DB design, and want a 2nd opinion.
I plan on having an EAV model, so every entity type can be extended with custom fields.
Why? Do all your entities require to be extensible in this way? Probably not -- in most applications there are one or two entities at most that would benefit from this level of flexibility. The other entities actually benefit from the stability and clarity of not changing all the time.
EAV is an example of the Inner-Platform Effect:
The Inner-Platform Effect is a result of designing a system to be so customizable that it ends becoming a poor replica of the platform it was designed with.
In other words, now it's your responsibility to write application code to do all the things that a proper RDBMS already provides, like constraints and data types. Even something as simple as making a column mandatory like NOT NULL doesn't work in EAV.
It's true sometimes a project requires a lot of tables. But you're fooling yourself if you think you have simplified the project by making just two tables. You will still have just as many distinct Entities as you would have had tables, but now it's up to you to keep them from turning into a pile of rubbish.
Before you invest too much time into EAV, read this story about a company that nearly ceased to function because someone tried to make their data repository arbitrarily flexible: Bad CaRMa.
I also wrote more about EAV in a blog post, EAV FAIL, and in a chapter of my book, SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.
You haven't really given a design. If you had given a description of tables, the application-oriented criterion for when a row goes in of each them and consequent constraints including keys, fks etc for the part of your application involving your entities then you would have given part of a design. In other words, if you had given that part's straightforward relational design. (Just because you're not implementing it that way doesn't mean you don't need to design properly.) Notice that this must include application-level state and functionality for "extending with custom fields". But then you have to give a description of tables, the criterion for when a row goes in each of them and consequent constraints including keys, fks etc for the part of your implementation that encodes the previous part via EAV, plus operators for manipulating them. In other words, if you had given that part's straightforward relational design. The part of your design that is implementing a DBMS. Then you would really have given a design.
The notion that one needs to use EAV "so every entity type can be extended with custom fields" is mistaken. Just implement via calls that update metadata tables sometimes instead of just updating regular tables: DDL instead of DML.

Relational Table Design For Single Object w/Multiple Types

I am creating a database for a web application and am looking for some suggestions to model a single entity that might have multiple types, with each type having differing attributes.
As an example assume that I want to create a relational model for a "Data Source" object. There will be some shared attributes of all data sources, such as a numerical identifier, a name, and a type. Each type will then have differing attributes based on the type. For the sake of argument let's say we have two types, "SFTP" and "S3".
For the S3 type we might have to store the bucket, AWSAccessKeyId, YourSecretAccessKeyID, etc. For SFTP we would have to store the address, username, password, potentially a key of some sort.
My first inclination would be to break out each type into their own table with any non-common fields being represented in that new table with a foreign key in the main "Data Source" table. What I don't like about that is that I would then have to know which table is associated with each type that is stored in the main table and rewrite the queries coming from the web app dynamically based on that type.
Is there a simple solution or best practices I'm missing here?
What you are describing is a situation where you want to implement table inheritance. There are three methods for doing this, all described in Martin Fowler's excellent book, Patterns of Enterprise Application Architecture.
What you describe as your first inclination is called Class Table Inheritance by Fowler. It is the method that I tend to use in my database designs, but doesn't always fit well. This method corresponds most closely to an OO view of the database, with a table representing an abstract class and other tables representing concrete implementations of the abstract class. Data must be queried and updated from multiple tables.
It sounds like what you actually want to use is called Single Table Inheritance by Fowler. In this method, you'd actually put columns for all of your data in one table, with a discriminator column to identify which fields are associated with the element type. Queries are generally simpler, although you do have to deal with the discriminator column.
Finally, the third type is called Concrete Table Inheritance by Fowler. In my mind, this is the least useful. In this method, you give up all concepts of having any kind of hierarchical data, and create a single table for each element type. Still, there are times when this might work for you.
All three methods have their pros and cons. You should consult the links above to see which might work best for you in your project.

Implementing inheritance in MySQL: alternatives and a table with only surrogate keys

This is a question that has probably been asked before, but I'm having some difficulty to find exactly my case, so I'll explain my situation in search for some feedback:
I have an application that will be registering locations, I have several types of locations, each location type has a different set of attributes, but I need to associate notes to locations regardless of their type and also other types of content (mostly multimedia entries and comments) to said notes. With this in mind, I came up with a couple of solutions:
Create a table for each location type, and a "notes" table for every location table with a foreign key, this is pretty troublesome because I would have to create a multimedia and comments table for every comments table, e.g.:
LocationTypeA
ID
Attr1
Attr2
LocationTypeA_Notes
ID
Attr1
...
LocationTypeA_fk
LocationTypeA_Notes_Multimedia
ID
Attr1
...
LocationTypeA_Notes_fk
And so on, this would be quite annoying to do, but after it's done, developing on this structure should not be so troublesome.
Create a table with a unique identifier for the location and point content there, like so:
Location
ID
LocationTypeA
ID
Attr1
Attr2
Location_fk
Notes
ID
Attr1
...
Location_fk
Multimedia
ID
Attr1
...
Notes_fk
As you see, this is far more simple and also easier to develop, but I just don't like the looks of that table with only IDs (yeah, that's truly the only objection I have to this, it's the option I like the most, to be honest).
Similar to option 2, but I would have an enormous table of attributes shaped like this:
Location
ID
Type
Attribute
Name
Value
And so on, or a table for each attribute; a la Drupal. This would be a pain to develop because then it would take several insert/update operations to do something on a location and the Attribute table would be several times bigger than the location table (or end up with an enormous amount of attribute tables); it also has the same issue of the surrogate-keys-only table (just it has a "type" now, which I would use to define the behavior of the location programmatically), but it's a pretty solution.
So, to the question: which would be a better solution performance and scalability-wise?, which would you go with or which alternatives would you propose? I don't have a problem implementing any of these, options 2 and 3 would be an interesting development, I've never done something like that, but I don't want to go with an option that will collapse on itself when the content grows a bit; you're probably thinking "why not just use Drupal if you know it works like you expect it to?", and I'm thinking "you obviously don't know how difficult it is to use Drupal, either that or you're an expert, which I'm most definitely not".
Also, now that I've written all of this, do you think option 2 is a good idea overall?, do you know of a better way to group entities / simulate inheritance? (please, don't say "just use inheritance!", I'm restricted to using MySQL).
Thanks for your feedback, I'm sorry if I wrote too much and meant too little.
ORM systems usually use the following, mostly the same solutions as you listed there:
One table per hierarchy
Pros:
Simple approach.
Easy to add new classes, you just need to add new columns for the additional data.
Supports polymorphism by simply changing the type of the row.
Data access is fast because the data is in one table.
Ad-hoc reporting is very easy because all of the data is found in one table.
Cons:
Coupling within the class hierarchy is increased because all classes are directly coupled to the same table.
A change in one class can affect the table which can then affect the other classes in the hierarchy.
Space potentially wasted in the database.
Indicating the type becomes complex when significant overlap between types exists.
Table can grow quickly for large hierarchies.
When to use:
This is a good strategy for simple and/or shallow class hierarchies where there is little or no overlap between the types within the hierarchy.
One table per concrete class
Pros:
Easy to do ad-hoc reporting as all the data you need about a single class is stored in only one table.
Good performance to access a single object’s data.
Cons:
When you modify a class you need to modify its table and the table of any of its subclasses. For example if you were to add height and weight to the Person class you would need to add columns to the Customer, Employee, and Executive tables.
Whenever an object changes its role, perhaps you hire one of your customers, you need to copy the data into the appropriate table and assign it a new POID value (or perhaps you could reuse the existing POID value).
It is difficult to support multiple roles and still maintain data integrity. For example, where would you store the name of someone who is both a customer and an employee?
When to use:
When changing types and/or overlap between types is rare.
One table per class
Pros:
Easy to understand because of the one-to-one mapping.
Supports polymorphism very well as you merely have records in the appropriate tables for each type.
Very easy to modify superclasses and add new subclasses as you merely need to modify/add one table.
Data size grows in direct proportion to growth in the number of objects.
Cons:
There are many tables in the database, one for every class (plus tables to maintain relationships).
Potentially takes longer to read and write data using this technique because you need to access multiple tables. This problem can be alleviated if you organize your database intelligently by putting each table within a class hierarchy on different physical disk-drive platters (this assumes that the disk-drive heads all operate independently).
Ad-hoc reporting on your database is difficult, unless you add views to simulate the desired tables.
When to use:
When there is significant overlap between types or when changing types is common.
Generic Schema
Pros:
Works very well when database access is encapsulated by a robust persistence framework.
It can be extended to provide meta data to support a wide range of mappings, including relationship mappings. In short, it is the start at a mapping meta data engine.
It is incredibly flexible, enabling you to quickly change the way that you store objects because you merely need to update the meta data stored in the Class, Inheritance, Attribute, and AttributeType tables accordingly.
Cons:
Very advanced technique that can be difficult to implement at first.
It only works for small amounts of data because you need to access many database rows to build a single object.
You will likely want to build a small administration application to maintain the meta data.
Reporting against this data can be very difficult due to the need to access several rows to obtain the data for a single object.
When to use:
For complex applications that work with small amounts of data, or for applications where you data access isn’t very common or you can pre-load data into caches.

(Somewhat) complicated database structure vs. simple — with null fields

I'm currently choosing between two different database designs. One complicated which separates data better then the more simple one. The more complicated design will require more complex queries, while the simpler one will have a couple of null fields.
Consider the examples below:
Complicated:
Simpler:
The above examples are for separating regular users and Facebook users (they will access the same data, eventually, but login differently). On the first example, the data is clearly separated. The second example is way simplier, but will have at least one null field per row. facebookUserId will be null if it's a normal user, while username and password will be null if it's a Facebook-user.
My question is: what's prefered? Pros/cons? Which one is easiest to maintain over time?
First, what Kirk said. It's a good summary of the likely consequences of each alternative design. Second, it's worth knowing what others have done with the same problem.
The case you outline is known in ER modeling circles as "ER specialization". ER specialization is just different wording for the concept of subclasses. The diagrams you present are two different ways of implementing subclasses in SQL tables. The first goes under the name "Class Table Inheritance". The second goes under the name "Single Table Inheritance".
If you do go with Class table inheritance, you will want to apply yet another technique, that goes under the name "shared primary key". In this technique, the id fields of facebookusers and normalusers will be copies of the id field from users. This has several advantages. It enforces the one-to-one nature of the relationship. It saves an extra foreign key in the subclass tables. It automatically provides the index needed to make the joins run faster. And it allows a simple easy join to put specialized data and generalized data together.
You can look up "ER specialization", "single-table-inheritance", "class-table-inheritance", and "shared-primary-key" as tags here in SO. Or you can search for the same topics out on the web. The first thing you will learn is what Kirk has summarized so well. Beyond that, you'll learn how to use each of the techniques.
Great question.
This applies to any abstraction you might choose to implement, whether in code or database. Would you write a separate class for the Facebook user and the 'normal' user, or would you handle the two cases in a single class?
The first option is the more complicated. Why is it complicated? Because it's more extensible. You could easily include additional authentication methods (a table for Twitter IDs, for example), or extend the Facebook table to include... some other facebook specific information. You have extracted the information specific to each authentication method into its own table, allowing each to stand alone. This is great!
The trade off is that it will take more effort to query, it will take more effort to select and insert, and it's likely to be messier. You don't want a dozen tables for a dozen different authentication methods. And you don't really want two tables for two authentication methods unless you're getting some benefit from it. Are you going to need this flexibility? Authentication methods are all similar - they'll have a username and password. This abstraction lets you store more method-specific information, but does that information exist?
Second option is just the reverse the first. Easier, but how will you handle future authentication methods and what if you need to add some authentication method specific information?
Personally I'd try to evaluate how important this authentication component is to the system. Remember YAGNI - you aren't gonna need it - and don't overdesign. Unless you need that extensibility that the first option provides, go with the second. You can always extract it at a later date if necessary.
This depends on the database you are using. For example Postgres has table inheritance that would be great for your example, have a look here:
http://www.postgresql.org/docs/9.1/static/tutorial-inheritance.html
Now if you do not have table inheritance you could still create views to simplify your queries, so the "complicated" example is a viable choice here.
Now if you have infinite time than I would go for the first one (for this one simple example and prefered with table inheritance).
However, this is making things more complicated and so will cost you more time to implement and maintain. If you have many table hierarchies like this it can also have a performance impact (as you have to join many tables). I once developed a database schema that made excessive use of such hierarchies (conceptually). We finally decided to keep the hierarchies conceptually but flatten the hierarchies in the implementation as it had gotten so complex that is was not maintainable anymore.
When you flatten the hierarchy you might consider not using null values, as this can also prove to make things a lot harder (alternatively you can use a -1 or something).
Hope these thoughts help you!
Warning bells are ringing loudly with the presence of two the very similar tables facebookusers and normalusers. What if you get a 3rd type? Or a 10th? This is insane,
There should be one user table with an attribute column to show the type of user. A user is a user.
Keep the data model as simple as you possibly can. Don't build it too much kung fu via data structure. Leave that for the application, which is far easier to alter than altering a database!
Let me dare suggest a third. You could introduce 1 (or 2) tables that will cater for extensibility. I personally try to avoid designs that will introduce (read: pollute) an entity model with non-uniformly applicable columns. Have the third table (after the fashion of the EAV model) contain a many-to-one relationship with your users table to cater for multiple/variable user related field.
I'm not sure what your current/short term needs are, but re-engineering your app to cater for maybe, twitter or linkedIn users might be painful. If you can abstract the content of the facebookUserId column into an attribute table like so
user_attr{
id PK
user_id FK
login_id
}
Now, the above definition is ambiguous enough to handle your current needs. If done right, the EAV should look more like this :
user_attr{
id PK
user_id FK
login_id
login_id_type FK
login_id_status //simple boolean flag to set the validity of a given login
}
Where login_id_type will be a foreign key to an attribute table listing the various login types you currently support. This gives you and your users flexibility in that your users can have multiple logins using different external services without you having to change much of your existing system

Single table or seperate table for each user to hold similar records? (performance??)

I have 2 scenarios for a MySQL DB and I'm not sure which to choose, and I've run into the same dilemma for a few tables.
I'm making a web application only accessed by members. Each member has their own deals, expenses, and say "listings". The criteria for the records is the same across users, but each user can have completely different amounts of records.
My 2 scenarios are whether I should have one table for deals, one table for listings, one table for expenses...and have a field in each that links to the primary key for a particular user. Or...if it is better to have a separate deal table, expense table, and listing table for each user..(using a combined string like "user"+deals, or "user"+exp). Deals can be used across 1 or 2 users, but expenses and listings are completely independent. I am going to have a master deal table to hold all the info for each deal, but there is a user deal table(s) that links their primary key to a deal primary key.
So, separate tables or one table? If there are thousands of users with hundreds of deals/expenses/listings..I just don't want the queries to be extremely slow after a lot of deals or expenses have built up...No user will ever need to view anything from other users...strictly just their data.
Also, I'm familiar with how a database works and stores data, but I'm not 100% clear. I just want it to work quickly, so my other question is (although it may be stupid) when a user submits a new deal or expense...is it inserted in the beginning or end the table? Or is it irrelevant...because a query will search everything in the table either way before returning information?
Always use one table to store one kind of entity.
Or more specifically, what you're talking about is a nasty, complicated optimisation that works in an incredibly small subset of cases which almost certainly isn't yours.
You want to use just one table for one kind of entry. Index it appropriately, and try to get rid of old records when you don't need them any more.
Also, a lot of peoples' idea of "big data" isn't actually particularly big. Databases normally need little optimisation while their data still fit in RAM, which on a modern system means, say, 32Gb.
Regarding your second question:
In MySql the order of the records on the disk is defined by your PRIMARY KEY. Meaning a record does not get inserted at the end or the beginning, but rather wherever it belongs based on the primary key.
In other db's you have th option to use CLUSTERED KEYS in order to use another key than the PRIMARY to order the records on disk, but this is not supported in MySql to my knowledge.
Regarding your first question:
I found myself in this position a couple of times and recently I keep getting back to one blog post (last of a series, the conclusion is in the bottom):
http://weblogs.asp.net/manavi/archive/2011/01/03/inheritance-mapping-strategies-with-entity-framework-code-first-ctp5-part-3-table-per-concrete-type-tpc-and-choosing-strategy-guidelines.aspx
I quote:
Before we get into this discussion, I
want to emphasize that there is no one
single "best strategy fits all
scenarios" exists. As you saw, each of
the approaches have their own
advantages and drawbacks. Here are
some rules of thumb to identify the
best strategy in a particular
scenario:
If you don’t require polymorphic associations or queries, lean toward
TPC—in other words, if you never or
rarely query for BillingDetails and
you have no class that has an
association to BillingDetail base
class. I recommend TPC (Table per Concrete Type) (only) for the
top level of your class hierarchy,
where polymorphism isn’t usually
required, and when modification of the
base class in the future is unlikely.
If you do require polymorphic associations or queries, and
subclasses declare relatively few
properties (particularly if the main
difference between subclasses is in
their behavior), lean toward TPH (Table per Hierarchy). Your
goal is to minimize the number of
nullable columns and to convince
yourself (and your DBA) that a
denormalized schema won’t create
problems in the long run.
If you do require polymorphic associations or queries, and
subclasses declare many properties
(subclasses differ mainly by the data
they hold), lean toward TPT (Table per Type). Or,
depending on the width and depth of
your inheritance hierarchy and the
possible cost of joins versus unions,
use TPC.
By default, choose TPH only for simple
problems. For more complex cases (or
when you’re overruled by a data
modeler insisting on the importance of
nullability constraints and
normalization), you should consider
the TPT strategy. But at that point,
ask yourself whether it may not be
better to remodel inheritance as
delegation in the object model
(delegation is a way of making
composition as powerful for reuse as
inheritance). Complex inheritance is
often best avoided for all sorts of
reasons unrelated to persistence or
ORM. EF acts as a buffer between the
domain and relational models, but that
doesn’t mean you can ignore
persistence concerns when designing
your classes.