Preserve data integrity in a database structure with two paths of association - mysql

I have this situation that is as simple as it is annoying.
The requirements are
Every item must have an associated category.
Every item MAY be included in a set.
Sets must be composed of items of the same category.
There may be several sets of the same category.
The desired logic procedure to insert new data is as following:
Categories are inserted.
Items are inserted. For each new item, a category is assigned.
Sets of items of the same category are created.
I'd like to get a design where data integrity between tables is ensured.
I have come up with the following design, but I can't figure out how to maintain data integrity.
If the relationship highlighted in yellow is not taken into account, everything is very simple and data integrity is forced by design: an item acquires a category only when it is assigned to a set and the category is given by the set itself.However, it would not be possible to have items not associated with a set but linked to a category and this is annoying.
I want to avoid using special "bridging sets" to assign a category to an item since it would feel hacky and there is no way to distinguish between real sets and special ones.
So I introduced the relationship in yellow. But now you can create sets of objects of different categories!
How can I avoid this integrity problem using only plain constraints (index, uniques, FK) in MySQL?
Also I would like to avoid triggers as I don't like them as it seems a fragile and not very reliable way to solve this problem...
I've read about similar question like How to preserve data integrity in circular reference database structure? but I cannot understand how to apply the solution in my case...

Interesting scenario. I don't see a slam-dunk 'best' approach. One consideration here is: what proportion of items are in sets vs attached only to categories?
What you don't want is two fields on items. Because, as you say, there's going to be data anomalies: an item's direct category being different to the category it inherits via its set.
Ideally you'd make a single field on items that is an Algebraic Data Type aka Tagged Union, with a tag saying its payload was a category vs a set. But SQL doesn't support ADTs. So any SQL approach would have to be a bit hacky.
Then I suggest the compromise is to make every item a member of a set, from which it inherits its category. Then data access is consistent: always JOIN items-sets-categories.
To support that, create dummy sets whose only purpose is to link to a category.
To address "there is no way to distinguish between real sets and special ones": put an extra field/indicator on sets: this is a 'real' set vs this is a link-to-category set. (Or a hack: make the set-description as "Category: <category-name>".)
Addit: BTW your "desired logic procedure to insert new data" is just wrong: you must insert sets (Step 3) before items (Step 2).

I think I might found a solution by looking at the answer from Roger Wolf to a similar situation here:
Ensuring relationship integrity in a database modelling sets and subsets
Essentially, in the items table, I've changed the set_id FK to a composite FK that references both set.id and set.category_id from, respectively, items.set_id and item.category_id columns.
In this way there is an overlap of the two FKs on items table.
So for each row in items table, once a category_id is chosen, the FK referring to the sets table is forced to point to a set of the same category.
If this condition is not respected, an exception is thrown.
Now, the original answer came with an advice against the use of this approach.
I am uncertain whether this is a good idea or not.
Surely it works and I think that is a fairly elegant solution compared to the one that uses tiggers for such a simple piece of a a more complex design.
Maybe the same solution is more difficult to understand and maintain if heavily applied to a large set of tables.
Edit:
As AntC pointed out in the comments below, this technique, although working, can give insidious problems e.g. if you want to change the category_id for a set.
In that case you would have to update the category_id of each item linked to that set.
That needs BEGIN COMMIT/END COMMIT wrapped around the updates.
So ultimately it's probably not worth it and it's better to investigate the requirements further in order to find a better schema.

Related

Extending a Table with Users' Individual Preferred Custom Properties

What is the most performance-efficient way to allow end-users to add custom properties to a core table used by an application.
For example, core table FRIENDS has columns ID, FIRST_NAME, LAST_NAME, and BIRTHDAY.
User 1 wants to also track additional properties FAVORITE_COLOR and LUCKY_NUMBER, but User 2 wants to also track different additional properties ZODIAC_SIGN, MARRIAGE_ANNIVERSRY_DATE, and GOLF_HANDICAP.
I have implemented two approaches for testing:
First approach: Add a new table FRIENDS_CUSTOM_PROPERTIES having an FK pointer back to FRIENDS and two columns for value pairs (KEY and VALUE such as FAVORITE_COLOR, YELLOW). This approach potentially requires many queries on FRIENDS_CUSTOM_PROPERTIES to retrieve all the properties for a given friend.
Second approach: Add extension columns right on the FRIENDS table itself of varying data types for CUSTOM_1, CUSTOM_2, ... CUSTOM_64, etc. If a user needed more custom properties than there were columns, my design would "spill over" to approach 1. This approach is more brute force but easily results in many NULL column values on many rows.
I can make both work but am unsure the best approach to determine which is better (or if there is already a clear best practice one way or another).
Thanks.
Approach number one is called entity-attribute-value as Rick James noted in the comments. It can do the job, but you sacrifice lots of useful features of SQL, like data types and constraints. See EAV FAIL for some of my writing on this.
You wrote something about running "many queries" but there's no advantage to doing that. You should plan on fetching the set of custom properties for a user in one query, and saving it to a map object in your client application.
The latter approach number two is incomplete. You would also need to store some kind of metadata so you know that for user 1, CUSTOM_1 means "Favorite Color" and CUSTOM_2 means "Lucky Number" and so on. Where do you plan to store the meaning of each column per user?
At least with the EAV design, each attribute comes with a key, so you know exactly what it means. And EAV allows for an unlimited number of properties, because each property gets a new row.
Ultimately, any design that allow for "user-defined properties" conflicts with principles of relational databases. Your columns no longer have any concept of a type. Read a book like SQL and Relation Theory to understand more about this.

MySQL - appropriate application of VIEWS or FOREIGN KEYS

Suppose I have a table that acts as an inventory of my house - inventory_items if you will. inventory_items contains everything I own, but only the most general information (i.e fields that will apply to everything I own, like a name, purchase date).
I then wish to have a separate table for electronics_data which is an inventory item, but has special information to store (lets say serial_number, wattage) and another for furniture_data which contains furniture specific information (number_of_legs, material).
In all instances, items in electronics_data will have a matching item in inventory_items linked by an id field. The same is true of furniture_data.
If I now wish to show a list of my inventory items, but include specific information from the child tables, logically I think to load the inventory_data, find out what type of item this is, and load the right information from the right table. I can think of two better ways:
1) Create a foreign key relationship between inventory_items and electronics_data - thus loading all items will get me all of my child data too. But, not all items in inventory_items will have a matching item in electronics_data so does this mean a foreign key can't work?
2) Create a view which loads the extra tables if a matching item exists in them, and load the view in my application. If I have lots of different 'types' of data, will this make my view unnecessarily slow (checking everything) and actually defeat the object of the view in the first place?
These are general questions - particularly 2) I would imagine is very data dependent.
Thanks!
1) Foreign keys will work, since the specialised tables are the child tables, so you need to make sure that each record in the child table has a corresponding record in the overall inventory_items table. The reverse is not necessarily true.
2) The view can left join the child tables on the inventory_items table. If the fields used in the join are indexed in all tables, then the operation is not that resource intensive. The biggest catch could be how you build the view, if you have lots of specialised child tables. But this is probably a wider application design question anyway (if you are looking at your electronic devices, then you probably do not want to see the fields from the furniture items table - in these specialised views I would use inner join, not left join).
well it will make your life easier if you could join the tables when extracting data. There are a lot of ways to join tables, in your case if all your tables have an I.D column then you could use an 'Equijoin' This is how you could do so
SELECT inventory_items.name, electronics_data.wattage, furniture_data.material
FROM inventory_items, electronics_data, furniture_data
WHERE inventory_items.i.d=electronics_data.i.d=furniture_data.id;
so with a join like this you can add as many columns as you wish but make sure to highlight the table they are from and in the 'WHERE' clause show where they are equal otherwise it wont return any data
I have posted an fairly detailed response to a similar question here, even how to define the views you mention. Note that the code shown in the view definition is for illustration only. It will not show the most efficient way to write it. Better ways should be fairly straight-forward, however.
A word about view performance. Take a view that joins very large tables in such a way that the query
select * from <view>
takes a long time, say 30 minutes. The query
select * from <view> where <criteria>
could take fractions of a second. In most modern DBMSs, the where criteria is merged with the existing query in the view definition to execute the query. It does not execute the view definition and then do the filtering. So test view performance with actual queries not "data dump" queries.

How do I join on the correct one-to-one table (super-type, sub-type model)?

I've been looking into first, second, and third normal forms, and I want to do a better job normalizing my tables. Part of this, I realized, was that I've never understood the purpose of one-to-one tables. From what I understand, "optional" data should be grouped into another table, leaving distinct entities intact, while avoiding the nuances of maintaining several NULL fields in one monolithic table.
So, a real-world scenario. In a CMS, I want to maintain several different types of "pages," making it extendable by additional plugins without affecting the original schema. I have these as sample tables so far:
Pages (title, path, type, etc.)
ContentPages (same as base page, but with keyword/description/content fields)
LinkedPages (same as base page, but contains a reference to another page)
ProductPages (same as base page, but with SKU and other ecomm-related info)
So far, so good. No NULLS. Self-documented design. Super-typing / Sub-typing is consistent between my PHP models and database. Everything's DANDY.
EXCEPT, given any page ID, I don't want to do a first query to get the base page info, figure out what type of page it is, and then get the corresponding sub-type information with another query. Do I have to keep track of this with application state (or URL), or is there a way to know which table to join on, while only knowing the page ID and nothing else?
This is really easy with only one table (obviously), as the NULL fields imply the type, or an ENUM can tell me what it is. Switching back to 1NF isn't an acceptable answer, as I already know how to do it. I want to learn this way ;)
UPDATE: Also wanted to mention that each of the sub-type properties is unique to that type. So, any common property shared by all types will, of course, go into the base page table. Sub-types won't share any other properties. This seemed like a logical way to group the sub-tables, but maybe I'm defeating the purpose of one-to-one tables with this arrangement...
It depends on who's asking the question. If your plugin is driving the query then it can start at its specific subtype and join in the supertype, which it knows must exist.
I don't know what your business requirements are, but it seems to me that if you are trying to keep things modular then you want to drive as many joins from the child side (i.e. the plugin side) as possible.
If you are going to have a query driven from the supertype to the subtype then you can use an outer join and just be ready in your code to handle null columns if the subtype in question isn't present. Obviously that approach is less modular, but I suppose there could be times when that is what you need or want to do.
you could create a view by left outer joining all the subtypes on the main Page table. The view could be queried by a single page_id and would return one row with many null values, the same as you'd get with one big 1st normal form page table.
is there a way to know which table to join on, while only knowing the
page ID and nothing else?
Well, in a supertype/subtype structure, you should know more than the page ID. You should also know the subtype.
Usually, a supertype/subtype structure for 'n' subtypes maps to
n + 1 tables, one for each subtype, plus one for the supertype, and
n updatable views, each of which joins the supertype with the appropriate subtype
So your application should usually be working with the views, not with the base tables. (Usually, but not always.)
If you're not using the views, then when you retrieve the page id numbers from the supertype, you should also retrieve the column that identifies the subtype. Don't have such a column? Fix that. And see this other relevant SO database design problem for a supertype/subtype with code, a description of the structure, and the logic behind it.

Implementing Comments and Likes in database

I'm a software developer. I love to code, but I hate databases... Currently, I'm creating a website on which a user will be allowed to mark an entity as liked (like in FB), tag it and comment.
I get stuck on database tables design for handling this functionality. Solution is trivial, if we can do this only for one type of thing (eg. photos). But I need to enable this for 5 different things (for now, but I also assume that this number can grow, as the whole service grows).
I found some similar questions here, but none of them have a satisfying answer, so I'm asking this question again.
The question is, how to properly, efficiently and elastically design the database, so that it can store comments for different tables, likes for different tables and tags for them. Some design pattern as answer will be best ;)
Detailed description:
I have a table User with some user data, and 3 more tables: Photo with photographs, Articles with articles, Places with places. I want to enable any logged user to:
comment on any of those 3 tables
mark any of them as liked
tag any of them with some tag
I also want to count the number of likes for every element and the number of times that particular tag was used.
1st approach:
a) For tags, I will create a table Tag [TagId, tagName, tagCounter], then I will create many-to-many relationships tables for: Photo_has_tags, Place_has_tag, Article_has_tag.
b) The same counts for comments.
c) I will create a table LikedPhotos [idUser, idPhoto], LikedArticles[idUser, idArticle], LikedPlace [idUser, idPlace]. Number of likes will be calculated by queries (which, I assume is bad). And...
I really don't like this design for the last part, it smells badly for me ;)
2nd approach:
I will create a table ElementType [idType, TypeName == some table name] which will be populated by the administrator (me) with the names of tables that can be liked, commented or tagged. Then I will create tables:
a) LikedElement [idLike, idUser, idElementType, idLikedElement] and the same for Comments and Tags with the proper columns for each. Now, when I want to make a photo liked I will insert:
typeId = SELECT id FROM ElementType WHERE TypeName == 'Photo'
INSERT (user id, typeId, photoId)
and for places:
typeId = SELECT id FROM ElementType WHERE TypeName == 'Place'
INSERT (user id, typeId, placeId)
and so on... I think that the second approach is better, but I also feel like something is missing in this design as well...
At last, I also wonder which the best place to store counter for how many times the element was liked is. I can think of only two ways:
in element (Photo/Article/Place) table
by select count().
I hope that my explanation of the issue is more thorough now.
The most extensible solution is to have just one "base" table (connected to "likes", tags and comments), and "inherit" all other tables from it. Adding a new kind of entity involves just adding a new "inherited" table - it then automatically plugs into the whole like/tag/comment machinery.
Entity-relationship term for this is "category" (see the ERwin Methods Guide, section: "Subtype Relationships"). The category symbol is:
Assuming a user can like multiple entities, a same tag can be used for more than one entity but a comment is entity-specific, your model could look like this:
BTW, there are roughly 3 ways to implement the "ER category":
All types in one table.
All concrete types in separate tables.
All concrete and abstract types in separate tables.
Unless you have very stringent performance requirements, the third approach is probably the best (meaning the physical tables match 1:1 the entities in the diagram above).
Since you "hate" databases, why are you trying to implement one? Instead, solicit help from someone who loves and breathes this stuff.
Otherwise, learn to love your database. A well designed database simplifies programming, engineering the site, and smooths its continuing operation. Even an experienced d/b designer will not have complete and perfect foresight: some schema changes down the road will be needed as usage patterns emerge or requirements change.
If this is a one man project, program the database interface into simple operations using stored procedures: add_user, update_user, add_comment, add_like, upload_photo, list_comments, etc. Do not embed the schema into even one line of code. In this manner, the database schema can be changed without affecting any code: only the stored procedures should know about the schema.
You may have to refactor the schema several times. This is normal. Don't worry about getting it perfect the first time. Just make it functional enough to prototype an initial design. If you have the luxury of time, use it some, and then delete the schema and do it again. It is always better the second time.
This is a general idea
please donĀ“t pay much attention to the field names styling, but more to the relation and structure
This pseudocode will get all the comments of photo with ID 5
SELECT * FROM actions
WHERE actions.id_Stuff = 5
AND actions.typeStuff="photo"
AND actions.typeAction = "comment"
This pseudocode will get all the likes or users who liked photo with ID 5
(you may use count() to just get the amount of likes)
SELECT * FROM actions
WHERE actions.id_Stuff = 5
AND actions.typeStuff="photo"
AND actions.typeAction = "like"
as far as i understand. several tables are required. There is a many to many relation between them.
Table which stores the user data such as name, surname, birth date with a identity field.
Table which stores data types. these types may be photos, shares, links. each type must has a unique table. therefore, there is a relation between their individual tables and this table.
each different data type has its table. for example, status updates, photos, links.
the last table is for many to many relation storing an id, user id, data type and data id.
Look at the access patterns you are going to need. Do any of them seem to made particularly difficult or inefficient my one design choice or the other?
If not favour the one that requires the fewer tables
In this case:
Add Comment: you either pick a particular many/many table or insert into a common table with a known specific identifier for what is being liked, I think client code will be slightly simpler in your second case.
Find comments for item: here it seems using a common table is slightly easier - we just have a single query parameterised by type of entity
Find comments by a person about one kind of thing: simple query in either case
Find all comments by a person about all things: this seems little gnarly either way.
I think your "discriminated" approach, option 2, yields simpler queries in some cases and doesn't seem much worse in the others so I'd go with it.
Consider using table per entity for comments and etc. More tables - better sharding and scaling. It's not a problem to control many similar tables for all frameworks I know.
One day you'll need to optimize reads from such structure. You can easily create agragating tables over base ones and lose a bit on writes.
One big table with dictionary may become uncontrollable one day.
Definitely go with the second approach where you have one table and store the element type for each row, it will give you a lot more flexibility. Basically when something can logically be done with fewer tables it is almost always better to go with fewer tables. One advantage that comes to my mind right now about your particular case, consider you want to delete all liked elements of a certain user, with your first approach you need to issue one query for each element type but with the second approach it can be done with only one query or consider when you want to add a new element type, with the first approach it involves creating a new table for each new type but with the second approach you shouldn't do anything...

MySQL - Best method of saving and loading items

So on my older work, I had always used the 'text' data type to store items, like so:
0=4151:54;1=995:5000;2=521:1;
So basically: slot=item:amount;
I've been looking into finding the best ways of storing information in a sql database, and everywhere i go, it says that using text is a big performance hit.
I was thinking of doing something else, like having a table with the following columns:
id, owner_id, slot_id, item_id, amount
Where as now i can just insert a row for each item a character allocates. But i have no clue how to save them, since the slot's item can change, etc. A character has 28 inventory slots, and 500 bank slots, should i insert them all at registration? or is there a smarter way to save the items
Yes use that structure. Using text to store relational data defeats the purpose of a relational database.
I don't see what you mean by insert them all at registration. Can you not insert them as you need to?
Edit
Based on your previous comment I would recommend only inserting a slot as it is needed (if I understand your problem). It may be an idea to keep the ID of the slot in the application, if need be.
If I understand you correctly, and that the slot's item can change, then you want to further abstract the mapping between item_id and the item:
entry_tbl.item_id->item_rel_realitems_tbl.real_id->items_tbl
This way, all entries with an itemid point to a table that maps those ids to a mutable item. When you UPDATE an item in 'items_tbl' then the mapping automatically updates the entry_tbl.
Another JOIN is needed however. I would also use stored procedures in any case to abstract the mechanism from semantics.
I am not sure I understand the wording of your question however.