What is the best way to handle these MySQL database relationsships? - mysql

I'm building a small website that let users recommend their favourite books to eachother. So I have two tables, books and groups. A user can have 0 or more books in their library, and a book belongs to 1 or more groups. Currently, my tables look like this:
books table
|---------|------------|---------------|
| book_id | book_title | book_owner_id |
|---------|------------|---------------|
| 22 | something | 12 |
|---------|------------|---------------|
| 23 | something2 | 12 |
|---------|------------|---------------|
groups table
|----------|------------|---------------|---------|
| group_id | group_name | book_owner_id | book_id |
|----------|------------|---------------|---------|
| 231 | random | 12 | 22 |
|----------|------------|---------------|---------|
| 231 | random | 12 | 23 |
|----------|------------|---------------|---------|
As you can see, the relationsships between users+books and books+groups are defined in the tables. Should I define the relationsships in their own tables instead? Something like this:
books table
|---------|------------|
| book_id | book_title |
|---------|------------|
| 22 | something |
|---------|------------|
| 23 | something2 |
|---------|------------|
books_users_relationsship table
|---------|------------|---------|
| rel_id | user_id | book_id |
|---------|------------|---------|
| 1 | 12 | 22 |
|---------|------------|---------|
| 2 | 12 | 23 |
|---------|------------|---------|
groups table
|----------|------------|
| group_id | group_name |
|----------|------------|
| 231 | random |
|----------|------------|
groups_books_relationsship table
|----------|---------|
| group_id | book_id |
|----------|---------|
| 231 | 22 |
|----------|---------|
| 231 | 23 |
|----------|---------|
Thanks for your time.

The second form with four tables is the correct one. You could delete rel_id from books_users_relationsship as primary key might be composite with both user_id and book_id, just like in groups_books_relationsship table.

You do not need a "relationship table" to support a relationship. In Databases, implementing a Foreign Key in a child table defines the Relation between the parent and the child. You need tables only if they contain data, or to resolve a many-to-many relationship (and that has no data other than the Primary Keys of the parents).
The second problem you are facing, the reason the Relations become complex, and even optional, is due to the first two tables not being Normalised. Many problems ensue from that.
if you look closely at book, you may notice that the same book (title) gets repeated
likewise, there is no differentiation between (a) a book in terms of its existence in the world and (b) a copy of a book, that is owned by a member, and available for borrowing
eg. the review is about an existing book, once, and applies to all copies of a book; not to an owned book.
your "relationship" tables also have data in them, and the data is repeated.
all this repeated data needs to be maintained and kept in synch.
all those problems are eliminated if the data is Normalised.
Therefore (since you are seeking the "best way"), the sequence is to normalise the data first, after which (no surprise) the Relations are easy and not complex, and no data is repeated (in either the tables or the relations).
when Normalising, it is best to model the real world (not the entire real world, but whatever parts of it that you are implementing in the database). That insulates your database from the effects of change, and functional extensions to it in future do not require the existing tables to be changed.
It is also important to use accurate names for tables and columns, for the same reason. group in non-specific and will cause a problem in future when you implement some other form of grouping.
The relations can be now defined at the correct "level", between the correct tables.
The need to stick an Id column on everything that moves severely hinders your ability to understand the data and thus the Normalisation process, and robs the database of Relational power.
Notice that the existing keys are already unique and meaningful, short and efficient, no additional surrogate keys (and their additional index) is required.
ReviewerId, OwnerId and BorrowerIdare allMemberIds`, as Foreign Keys, showing the explicit Role in which they are used.
Note that your problem space is not as simple as you think, it is used as a case study and shipped with tutorials for SQL (eg. MS SQL, Sybase).
Social Library Data Model
Readers who are unfamiliar with the Standard for Modelling Relational Databases may find IDEF1X Notational useful.
I have provided the structure required to support borrowing, to again illustrate how easy it is to implement Relations on Normalised data, and to show the correct tables upon which borrowing depends (it is not between any book and any person; only owned book can be borrowed).
These issues are very important because they define the Referential Integrity of the database.
It is also important to implement that in the database itself, which is the Standard location (rather than in app code all over the place). Declarative Referential Integrity is part of IEC/ISO/ANSI Standard SQL. And the question has a database design tag.
Referential Integrity cannot be defined or enforced in some databases that do not fully implement the SQL Standard (sometimes it can be defined but it is not enforced, which is confusing). Nevertheless, you can design and implement whatever parts of a database your particular database supports.

Related

Modelling limited availability in Doctrine

This question targets modelling limited availability in Doctrine 2. I'm sure this has already been discussed here as it seems quite basic but I could not find any best practices. May it be that limit/restrict/max/... are bad search terms as they all mean something else in the db world :-).
Simplified example
Assume a typical online shop application that allows multiple users to buy items of some kind (at the same time). Some of these items may have a limited availability (first come first served). So two users may be in a concurrent situtation when trying to checkout/confirming the order. The faster one must win the race, the other order should not even be processed (inserted in the database).
Entities/tables may look like this:
items
+----+-----+---------------+---------+
| id | ... | max_available | version |
+----+-----|---------------|---------+
| 7 | | 4 | 2 |
| 8 | | 1 | 0 |
orders
+----+---------+----------+
| id | item_id | quantity |
+----+---------+----------+
| 1 | 7 | 2 |
| 1 | 7 | 1 |
In this case: Another order for item 8 with a quantity of 1 would be valid. Another order for item 7 with a quantity of 2 must be prevented as this would be one more that available.
Best practice?
The application uses Doctrine 2 ORM, the db will be MySQL. The system may be coupled to the db type but if there is a reasonable db agnostic way that's even better of course.
What's the best way to model this?
Transactions and locking on db level (db needs to support this)? Locking on ORM level (integer version field)? Or should there be (additionally) installed triggers that ensure data integrity on database level?
Sidenote: Should constraints be optional by design or can they be part of the business logic? In other words: Is it bad practice to test against constraints and let the test fail under normal conditions - e.g. by having a (concurrency safe) trigger on updates/inserts, that cancels the request if an item isn't available anymore? (This would only work for certain db types and InnoDB as the engine in the case of MySQL...)

Isn't using unnormalized design better when there are multiple JOINS?

Here is my table structure:
// posts
+----+-----------+---------------------+-------------+
| id | title | body | keywords |
+----+-----------+---------------------+-------------+
| 1 | title1 | Something here | php,oop |
| 2 | title2 | Something else | html,css,js |
+----+-----------+---------------------+-------------+
// tags
+----+----------+
| id | name |
+----+----------+
| 1 | php |
| 2 | oop |
| 3 | html |
| 4 | css |
| 5 | js |
+----+----------+
// pivot
+---------+--------+
| post_id | tag_id |
+---------+--------+
| 1 | 1 |
| 1 | 2 |
| 2 | 3 |
| 2 | 4 |
| 2 | 5 |
+---------+--------+
As you see, I store keywords in two ways. Both as string into a column named keywords and as relational into other tables.
Now I need to select all posts that have specific keywords (for example php and html tags). I can do that in two ways:
1: Using unnormalized design:
SELECT * FROM posts WHERE keywords REGEXP 'php|html';
2: Using normalized design:
SELECT posts.id, posts.title, posts.body, posts.keywords
FROM posts
INNER JOIN pivot ON pivot.post_id = posts.id
INNER JOIN tags ON tags.id = pivot.tag_id
WHERE tags.name IN ('html', 'php')
GROUP BY posts.id
See? The second approach uses two JOINs. I guess it will be slower than using REGEXP in huge dataset.
What do you think? I mean what's your recommendation and why?
The second approach uses two JOINs. I guess it will be slower than
using REGEXP in huge dataset.
Your intuition is simply wrong. Databases are designed to do JOINs. They can take advantage of indexing and partitioning to speed queries. More advanced databases (than MySQL) use statistics on tables to choose optimal algorithms for executing the query.
Your first query always requires a full table scan of posts. Your second query can be optimized in various ways.
Further, maintaining the consistency of the data in the data is much more difficult with the first approach. You probably need to implement triggers to handle updates and inserts on all the tables. That slows things down.
There are some cases where it is worth the effort to do this -- think about summary counts or totals of dollars or time. Putting tags into a delimited string is much less likely to be beneficial, because parsing the string in SQL is not likely to be a really big benefit relative to the other costs.
In small tables, you can use both at your discretion.
If you expect the table to grow, you really need to second choice. The reason behind is that The regexp can never use an index in MySQL. And indexes are the key to fast queries.
join will use an index if an index is declared on the column;
All these look good when we talk about data in lower scale. It's very fundamental theory for an OLTP system to have denormalize tables. When you expect your table to scale and want data to be non-redundant and consistent, normalization is the answer. Of course there are costs involved with join but thats trivial with all these issues.
Lets talk about your scenario:
Pros:
all data available querying one table.
Cons:
function wrapped across columns force query optimizer to scan the whole table irrespective of the column index. This is very important from data scaling point of view.
Keyword in your case repeated multiple time leading data redundancy.
Keywords appear multiple times lead to data inconsistencies, if you want to remove/update a keyword, it requires column to be searched and replace everywhere from each row. And if anycase anywhere the keywords left behind, leads data integrity issues.
There are many more. Go through data normalization in RDBMS.

in sql connecting transaction table with different kinds of transactions

I am making a mysql database for a restaurant.
I have a table called: *tbl_contents* which stores all the contents used in the preparation of different menu items.
Now I have to maintain a table for all the expenditures. These expenditures can be "purchasing contents" or some regular expenditure like electricity bill or rent of the restaurant.
How do I store two kinds of expenditures in the same table?
I have the table tbl_fixed_expanditures and tbl_contents.
If i buy something for the kitchen it is supposed to be stored in tbl_contents and if I have paid the electricity bill, it is saved in tbl_fixed_expenditures.
You are essentially trying to represent inheritance in a relational database.
You have two "classes" which are similar in some ways, and different in others. My suggestion is to create a table to act as a parent to both tbl_expanditures and tbl_fixed_expanditures.
Here's what I would do:
+------------------+
| tbl_expenditures |
+------------------+
| id |
+------------------+
+------------------------+
| tbl_fixed_expenditures |
+------------------------+
| id |
| expenditureId |
| ... |
+------------------------+
+---------------------------+
| tbl_variable_expenditures |
+---------------------------+
| id |
| expenditureId |
| ... |
+---------------------------+
...where tbl_fixed_expenditures.expenditureId and tbl_variable_expenditures.expenditureId both have a reference to tbl_expenditures.id.
This way, when you need to refer to them simply as "expenditures" (for example, in your transaction table), you can reference tbl_expenditures, and when you need information that is unique to either fixed or variable expenditures, you can refer to the "child" tables.
This is a very common problem with relational databases, and there are several ways of handling it, each of which have their pros and cons. IBM has a really good article outlining these options, and I highly recommend it for further reading:
http://www.ibm.com/developerworks/library/ws-mapping-to-rdb/
Well, it's kind of hard to give a proper answer but I can give you some vague conjectures based on what I've understood so far.
If both kinds of expenditures have different attributes but they have a few details in common, you should normalize the tables. use the expenditures_tranx as the intermediary table (like, in OO terms, the top-level class) and the remaining tables tbl_fixed_expanditures and tbl_contents can be the "specialized" tables (again, in OO terms, the ones that will "inherit" the attributes from the parent table) that will store more detailed information about the expenditures. Here's a simple Model Entity Relationship draft to illustrate.
____________ ___________________ _______________________
|tbl_contents|-1----*-| expenditures_tranx|-*---1-|tbl_fixed_expenditures|
|exp_id:fk___| |___________________| |exp_id:fk_____________|
Here's an interesting article that explain these concepts:
http://apps.topcoder.com/wiki/display/training/Entity+Relationship+Modeling
Let me know what you think.

Whether to merge avatar and profile tables?

I have two tables:
Avatars:
Id | UserId | Name | Size
-----------------------------------------------
1 | 2 | 124.png | Large
2 | 2 | 124_thumb.png | Thumb
Profiles:
Id | UserId | Location | Website
-----------------------------------------------
1 | 2 | Dallas, Tx | www.example.com
These tables could be merged into something like:
User Meta:
Id | UserId | MetaKey | MetaValue
-----------------------------------------------
1 | 2 | location | Dallas, Tx
2 | 2 | website | www.example.com
3 | 2 | avatar_lrg | 124.png
4 | 2 | avatar_thmb | 124_thumb.png
This to me could be a cleaner, more flexible setup (at least at first glance). For instance, if I need to allow a "user status message", I can do so without touching the database.
However, the user's avatars will be pulled far more than their profile information.
So I guess my real questions are:
What king of performance hit would this produce?
Is merging these tables just a really bad idea?
This is almost always a bad idea. What you are doing is a form of the Entity Attribute Value model. This model is sometimes necessary when a system needs a flexible attribute system to allow the addition of attributes (and values) in production.
This type of model is essentially built on metadata in lieu of real relational data. This can lead to referential integrity issues, orphan data, and poor performance (depending on the amount of data in question).
As a general matter, if your attributes are known up front, you want to define them as real data (i.e. actual columns with actual types) as opposed to string-based metadata.
In this case, it looks like users may have one large avatar and one small avatar, so why not make those columns on the user table?
We have a similar type of table at work that probably started with good intentions, but is now quite the headache to deal with. This is because it now has 100s of different "MetaKeys", and there is no good documentation about what is allowed and what each does. You basically have to look at how each is used in the code and figure it out from there. Thus, figure out how you will document this for future developers before you go down that route.
Also, to retrieve all the information about each user it is no longer a 1-row query, but an n-row query (where n is the number of fields on the user). Also, once you have that data, you have to post-process each of those based on your meta-key to get the details about your user (which usually turns out to be more of a development effort because you have to do a bunch of String comparisons). Next, many databases only allow a certain number of rows to be returned from a query, and thus the number of users you can retrieve at once is divided by n. Last, ordering users based on information stored this way will be much more complicated and expensive.
In general, I would say that you should make any fields that have specialized functionality or require ordering to be columns in your table. Since they will require a development effort anyway, you might as well add them as an extra column when you implement them. I would say your avatar pics fall into this category, because you'll probably have one of each, and will always want to display the large one in certain places and the small one in others. However, if you wanted to allow users to make their own fields, this would be a good way to do this, though I would make it another table that can be joined to from the user table. Below are the tables I'd suggest. I assume that "Status" and "Favorite Color" are custom fields entered by user 2:
User:
| Id | Name |Location | Website | avatarLarge | avatarSmall
----------------------------------------------------------------------
| 2 | iPityDaFu |Dallas, Tx | www.example.com | 124.png | 124_thumb.png
UserMeta:
Id | UserId | MetaKey | MetaValue
-----------------------------------------------
1 | 2 | Status | Hungry
2 | 2 | Favorite Color | Blue
I'd stick with the original layout. Here are the downsides of replacing your existing table structure with a big table of key-value pairs that jump out at me:
Inefficient storage - since the data stored in the metavalue column is mixed, the column must be declared with the worst-case data type, even if all you would need to hold is a boolean for some keys.
Inefficient searching - should you ever need to do a lookup from the value in the future, the mishmash of data will make indexing a nightmare.
Inefficient reading - reading a single user record now means doing an index scan for multiple rows, instead of pulling a single row.
Inefficient writing - writing out a single user record is now a multi-row process.
Contention - having mixed your user data and avatar data together, you've forced threads that only one care about one or the other to operate on the same table, increasing your risk of running into locking problems.
Lack of enforcement - your data constraints have now moved into the business layer. The database can no longer ensure that all users have all the attributes they should, or that those attributes are of the right type, etc.

Got a table of people, who I want to link to each other, many-to-many, with the links being bidirectional

Imagine you live in very simplified example land - and imagine that you've got a table of people in your MySQL database:
create table person (
person_id int,
name text
)
select * from person;
+-------------------------------+
| person_id | name |
+-------------------------------+
| 1 | Alice |
| 2 | Bob |
| 3 | Carol |
+-------------------------------+
and these people need to collaborate/work together, so you've got a link table which links one person record to another:
create table person__person (
person__person_id int,
person_id int,
other_person_id int
)
This setup means that links between people are uni-directional - i.e. Alice can link to Bob, without Bob linking to Alice and, even worse, Alice can link to Bob and Bob can link to Alice at the same time, in two separate link records. As these links represent working relationships, in the real world they're all two-way mutual relationships. The following are all possible in this setup:
select * from person__person;
+---------------------+-----------+--------------------+
| person__person_id | person_id | other_person_id |
+---------------------+-----------+--------------------+
| 1 | 1 | 2 |
| 2 | 2 | 1 |
| 3 | 2 | 2 |
| 4 | 3 | 1 |
+---------------------+-----------+--------------------+
For example, with person__person_id = 4 above, when you view Carol's (person_id = 3) profile, you should see a relationship with Alice (person_id = 1) and when you view Alice's profile, you should see a relationship with Carol, even though the link goes the other way.
I realize that I can do union and distinct queries and whatnot to present the relationships as mutual in the UI, but is there a better way? I've got a feeling that there is a better way, one where this issue would neatly melt away by setting up the database properly, but I can't see it. Anyone got a better idea?
I'm not sure if there is a better way to configure your tables. I think the way you have them is proper and would be the way I would implement it.
Since your relationship table can indicate unidirectional relationships, I would suggest treating them as such. In other words, for every relationship, I would add two rows. If Alice is collaborating with Bob, the table ought to be as follows:
select * from person__person;
+---------------------+-----------+--------------------+
| person__person_id | person_id | other_person_id |
+---------------------+-----------+--------------------+
| 1 | 1 | 2 |
| 2 | 2 | 1 |
+---------------------+-----------+--------------------+
The reason is because in a lot of ActiveRecord (Rails) like systems, the many-to-many table object would not be smart enough to query both person_id and other_person_id. By keeping two rows, ActiveRecord like objects will work correctly.
What you should do is then enforce the integrity of your data at the code level. Everytime a relationship is established between two users, two records should be inserted. When a relationship is destroyed, both records should be deleted. Users should not be allowed to establish relationships with themselves.
There is no way I can see using simple relational concepts. You will have to add "business" code to enforce your person-person relationships.
One way might be to enforce a second relationship record using an insert trigger
then declare one of the records "primary" (e.g. the one where the person_id is smaller than the other _person_id)
then build views and glue code for your applications (select, update, delete) that access the data with this knowledge.
There is no better idea. Relational databases cannot enforce what you ask so you have to write a special query to retrieve data and a trigger to enforce the constraint.
To get the related persons for #person I would go for:
SELECT CASE person_id WHEN #person
THEN other_person_id
ELSE person_id
END as related_person_id
FROM person_person
WHERE ( person_id=#person
OR other_person_id=#person)
You should find this post useful:
http://discuss.joelonsoftware.com/default.asp?design.4.361252.31
As #user posted you are generally better off creating two records per bidirectional relationship (Alice to Bob and Bob to Alice in your example). It makes querying much
easier and it accurately reflects the relationships. If you do have true unidirectional relationships, it's the only way to fly.