I want to know the best practice for relating one row in one table, to multiple rows in another table.
Let's say I have the following two tables:
table_users
id | username
------------------------------------------------------------------------
1 | user1
2 | user2
3 | user3
table_texts
id | text
------------------------------------------------------------------------
1 | This is a secret text, that only user2 and user3 should see.
Now my only solution is to create a third table:
table_user_text_relation
id | text_id | user_id
------------------------------------------------------------------------
1 | 1 | 2
2 | 1 | 3
And then select like this:
SELECT
table_texts.text
FROM
table_users, table_texts, table_user_text_relation
WHERE
table_users.id = table_user_text_relation.user_id
AND
table_texts.id = table_user_text_relation.text_id
And that is fine... However, if I have 6000 users each with access to 500 texts, table_user_text_relation would then have to have 3.000.000 rows to establish many to many relationships?
Is there a better / smarter way to do this?
This is how I do it. I have been searching a long time and having a match table is the best way. Because the match table only uses ints, it doesn't take up to much room. I would say this is the best practice.
Related
I'm building a simple friendship table with 3 columns: id, user1, user2
Once a user becomes friends with another user, their ids will be added to the table like this:
+----+-------+-------+
| id | user1 | user2 |
+----+-------+-------+
| 1 | 15 | 9 |
+----+-------+-------+
| 2 | 9 | 32 |
+----+-------+-------+
The above table is ok but in some cases, a user might want to become friends with a user they are already friends with, resulting in the following table:
+----+-------+-------+
| id | user1 | user2 |
+----+-------+-------+
| 1 | 15 | 9 |
+----+-------+-------+
| 2 | 9 | 32 |
+----+-------+-------+
| 3 | 9 | 15 |
+----+-------+-------+
In this table, index 1 and 3 are in conflict with each other (2 & 3 are NOT) and I would therefore like an error returned upon insertion (duplicate entry). Is there a way to do this?
When inserting into this table, before you issue the query, you should always make sure that user1 has the smaller of the two user IDs. Then the situation you described will never happen.
You can implement this either in your application code, or as a stored procedure that you execute to insert a new relation between two users.
Let me offer another perspective. You might want to keep the friends table as a reciprocal relationship. So, both directions would be stored in the table. The correct table would look like:
----+-------+-------+
| id | user1 | user2 |
+----+-------+-------+
| 1 | 15 | 9 |
+----+-------+-------+
| 2 | 9 | 15 |
+----+-------+-------+
| 3 | 9 | 32 |
+----+-------+-------+
| 4 | 32 | 9 |
+----+-------+-------+
Why would you want to do just an absurd thing, doubling the size of the data? Typical queries on such data are about who is a friend of friend or to list all the friends of a given user. Such queries require traversing this data as a graph structure, and you need both links. Not only do such queries become much more complicated with a single row per friendship, but because subqueries (derived tables) are often involved, the query loses the ability to use indexes.
With this structure, you need to be careful when inserting to insert both directions of the relationship. Then a simple unique constraint on the two columns ensures that duplicates are not inserted.
You can create a trigger to automatically fix this, similar to Dmytro's answer:
CREATE TRIGGER trgr_uid_check BEFORE INSERT ON Relationships
FOR EACH ROW
BEGIN
IF NEW.user1 > NEW.user2 THEN
SET #user1 = NEW.user1;
SET NEW.user1 = NEW.user2;
SET NEW.user2 = #user1;
END IF;
END
You could do a simple query to check if there is already a friendship:
SELECT id
FROM your_table
WHERE (user1 = numToInsert1 AND user2 = numToInsert2)
OR (user1 = numToInsert2 AND user2 = numToInsert1)
If this statement returns anything it means that there is already a friendship between those two. If this statement does not return anything insert your new friendship.
I wanted to ask you which could be the best approach creating my MySQL database structure having the following case.
I've got a table with items, which is not needed to describe as the only important field here is the ID.
Now, I'd like to be able to assign some attributes to each item - by its ID, of course. But I don't know exactly how to do it, as I'd like to keep it dynamic (so, I do not have to modify the table structure if I want to add a new attribute type).
What I think
I think - and, in fact, is the structure that I have right now - that I can make a table items_attributes with the following structure:
+----+---------+----------------+-----------------+
| id | item_id | attribute_name | attribute_value |
+----+---------+----------------+-----------------+
| 1 | 1 | place | Barcelona |
| 2 | 2 | author_name | Matt |
| 3 | 1 | author_name | Kate |
| 4 | 1 | pages | 200 |
| 5 | 1 | author_name | John |
+----+---------+----------------+-----------------+
I put data as an example for you to see that those attributes can be repeated (it's not a relation 1 to 1).
The problem with this approach
I have the need to make some querys, some of them for statistic purpouses, and if I have a lot of attributes for a lot of items, this can be a bit slow.
Furthermore - maybe because I'm not an expert on MySQL - everytime I want to make a search and find "those items that have 'place' = 'Barcelona' AND 'author_name' = 'John'", I end up having to make multiple JOINs for every condition.
Repeating the example before, my query would end up like:
SELECT *
FROM items its
JOIN items_attributes attr
ON its.id = attr.item_id
AND attr.attribute_name = 'place'
AND attr.attribute_value = 'Barcelona'
AND attr.attribute_name = 'author_name'
AND attr.attribute_value = 'John';
As you can see, this will return nothing, as an attribute_name cannot have two values at once in the same row, and an OR condition would not be what I'm searching for as the items MUST have both attributes values as stated.
So the only possibility is to make a JOIN on the same repeated table for every condition to search, which I think it's very slow to perform when there are a lot of terms to search for.
What I'd like
As I said, I'd like to be able to keep the attributes types dynamical, so by adding a new input on 'attribute_name' would be enough, without having to add a new column to a table. Also, as they are 1-N relationship, they cannot be put in the 'items' table as new columns.
If the structure, in your opinion, is the only one that can acheive my interests, if you could light up some ideas so the search queries are not a ton of JOINs it would be great, too.
I don't know if it's quite hard to get it as I've been struggling my head until now and I haven't come up with a solution. Hope you guys can help me with that!
In any case, thank you for your time and attention!
Kind regards.
You're thinking in the right direction, the direction of normalization. The normal for you would like to have in your database is the fifth normal form (or sixth, even). Stackoverflow on this matter.
Table Attribute:
+----+----------------+
| id | attribute_name |
+----+----------------+
| 1 | place |
| 2 | author name |
| 3 | pages |
+----+----------------+
Table ItemAttribute
+--------+----------------+
| item_id| attribute_id |
+--------+----------------+
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
+--------+----------------+
So for each property of an object (item in this case) you create a new table and name it accordingly. It requires lots of joins, but your database will be highly flexible and organized. Good luck!
In my Opinion it should be something like this, i know there are a lot of table, but actually it normilizes your DB
Maybe that is why because i cant understant where you get your att_value column, and what should contains this columns
Assume a very large database. A table with 900 million records.
Method A:
Table: Posts
+----------+-------------- +------------------+----------------+
| id (int) | item_id (int) | post_type (ENUM) | Content (TEXT) |
+----------+---------------+------------------+----------------+
| 1 | 1 | user | some text ... |
+----------+---------------+------------------+----------------+
| 2 | 1 | page | some text ... |
+----------+---------------+------------------+----------------+
| 3 | 1 | group | some text ... |
// row 1 : User with ID 1 has a post with ID #1
// row 2 : Page with ID 1 has a post with ID #2
// row 3 : Group with ID 1 has a post with ID #3
The goal is displaying 20 records from all 3 post_types in a page.
SELECT * FROM posts LIMIT 20
But I am worried about number of records for this method
Method B:
Separate 900 million records to 3 tables with 300 millions for each one.
Table: User Posts
+----------+-------------- +----------------+
| id (int) | user_id (int) | Content (TEXT) |
+----------+---------------+----------------+
| 1 | 1 | some text ... |
+----------+---------------+----------------+
| 2 | 2 | some text ... |
+----------+---------------+----------------+
| 3 | 3 | some text ... |
Table: Page Posts
+----------+-------------- +----------------+
| id (int) | page_id (int) | Content (TEXT) |
+----------+---------------+----------------+
| 1 | 1 | some text ... |
+----------+---------------+----------------+
| 2 | 2 | some text ... |
+----------+---------------+----------------+
| 3 | 3 | some text ... |
Table: Group Posts
+----------+----------------+----------------+
| id (int) | group_id (int) | Content (TEXT) |
+----------+----------------+----------------+
| 1 | 1 | some text ... |
+----------+----------------+----------------+
| 2 | 2 | some text ... |
+----------+----------------+----------------+
| 3 | 3 | some text ... |
now to get a list of 20 posts to display
SELECT * FROM User_Posts LIMIT 10
SELECT * FROM Page_Posts LIMIT 10
SELECT * FROM group_posts LIMIT 10
// and make an array or object of result. and display in output.
In this method, I should sort them in an array in php, and then semd them to page.
Which method is preferred?
Separating a 900 million records table to three tables will affect on speed of reading and writing in mysql?
This is actually a discussion about Singe - Table - Inheritance vs. Table Per Class Inheritance and missing out joined inheritance. The former is related to Method A, the second to your Method B and Method C would be as having all IDs of your posts in one table and deferring specific attributes for group or user - posts ijto different tables.
Whilst having a big sized table always has its negativ impacts related to table full scans the approach of splitting tables has it's own , too. It depends on how often your application needs to access the whole list of posts vs. only retrieving certain post types.
Another consideration you should take into account is data partitioning which can be done with MySQL or Oracle Database e.g. which is a way of organizing your data within tables given opportunities for information lifecycle (which data is accessed when and how often, can part of it be moved and compressed reducing database size and increasing the speed for accessing the left part of the data in the table), which is basically split into three major techniques:
Range based partitioning, list based partitioning and hash based partitioning.
Other features not so commonly supported related to reducing table sizes are the ones dealing with insert's with timestamp invalidating the inserted data automatically after a certain timeperiod has expired.
What indeed is a major application design decision and can boost performance is to distinguish between read and writeaccesses to the database at application level.
Consider a MySQL - Backend: Because writeaccesses are obviously more critical to database performance then read accesses you could setup a MySQL - Instance for writing to the database and another one as replicant of this for the readaccesses, though this is also discussable, mainly when it comes to RDT (real time decisions), where absolute consistency of data at any given time is a must.
Using object pools as a layer between your application and the database also is a technique to improve application performance though I don't know of existing solutions in the PHP world yet. Oracle Hot Cache is a pretty sophisticated example of it.
You could build your own one implemented on top of a in - memory database or using memcache, though.
I want to create a friends system (something like in facebook).
I want to save relationship data in MySql, but I do not know which way is better:
To save everysingle relationship as a single entry, such as:
id | people1 | people2
1 | john | maria
2 | john | fred
3 | maria | fred
(there i declare relationships between all of these 3 peoples)
To save everyone name and list his friends:
id | people | friends
1 | fred | mary, john
2 | mary | john, fred
3 | john | fred, mary
Or maybe there is better way?
No Dear,
you just need one single table for make friend relationship. structure is following i have used
id (primary key) | my_id( integer logged user id ) | friend_id ( integer user id of another user he will receive friend request from logged user)
like we have two users in our users table then we have two entries for both user to make relation with each other
id | name | age
1 | vipan | 12
2 | karan | 12
then entry should be
id | my_id | friend_id
1 1 2
2 2 1
Please don't vote down in any case but i have use this table structure in my site and this is same structure used in joomsocial this is best table structure i think so i use it and please don't use comma separated values in table they will make problem in joins and relationship in some cases
Please see 4 number comment in this following link of post
Separate comma separated values from mysql table
The first one is the best no doubt cause the second one would not respect the first normal form.
You have to avoid multiple values in the same column cause it will get really painful to edit
Here's the link about database normalization. Most of the time, we respect the third normal form cause it's a good compromise between normalization and performance.
Also, like Randy said, you have to use the IDs so then you can link them with a foreign key.
Just after some opinions on the best way to achieve the following outcome:
I would like to store in my MySQL database products which can be voted on by users (each vote is worth +1). I also want to be able to see how many times in total a user has voted.
To my simple mind, the following table structure would be ideal:
table: product table: user table: user_product_vote
+----+-------------+ +----+-------------+ +----+------------+---------+
| id | product | | id | username | | id | product_id | user_id |
+----+-------------+ +----+-------------+ +----+------------+---------+
| 1 | bananas | | 1 | matthew | | 1 | 1 | 2 |
| 2 | apples | | 2 | mark | | 2 | 2 | 2 |
| .. | .. | | .. | .. | | .. | .. | .. |
This way I can do a COUNT of the user_product_vote table for each product or user.
For example, when I want to look up bananas and the number of votes to show on a web page I could perform the following query:
SELECT p.product AS product, COUNT( v.id ) as votes
FROM product p
LEFT JOIN user_product_vote v ON p.id = v.product_id
WHERE p.id =1
If my site became hugely successful (we can all dream) and I had thousands of users voting on thousands of products, I fear that performing such a COUNT with every page view would be highly inefficient in terms of server resources.
A more simple approach would be to have a 'votes' column in the product table that is incremented each time a vote is added.
table: product
+----+-------------+-------+
| id | product | votes |
+----+-------------+-------+
| 1 | bananas | 2 |
| 2 | apples | 5 |
| .. | .. | .. |
While this is more resource friendly - I lose data (eg. I can no longer prevent a person from voting twice as there is no record of their voting activity).
My questions are:
i) am I being overly worried about server resources and should just stick with the three table option? (ie. do I need to have more faith in the ability of the database to handle large queries)
ii) is their a more efficient way of achieving the outcome without losing information
You can never be over worried about resources, when you first start building an application you should always have resources, space, speed etc. in mind, if your site's traffic grew dramatically and you never built for resources then you start getting into problems.
As for the vote system, personally I would keep the votes like so:
table: product table: user table: user_product_vote
+----+-------------+ +----+-------------+ +----+------------+---------+
| id | product | | id | username | | id | product_id | user_id |
+----+-------------+ +----+-------------+ +----+------------+---------+
| 1 | bananas | | 1 | matthew | | 1 | 1 | 2 |
| 2 | apples | | 2 | mark | | 2 | 2 | 2 |
| .. | .. | | .. | .. | | .. | .. | .. |
Reasons:
Firstly user_product_vote does not contain text, blobs etc., it's purely integer so it takes up less resources anyways.
Secondly, you have more of a doorway to new entities within your application such as Total votes last 24 hr, Highest rated product over the past 24 hour etc.
Take this example for instance:
table: user_product_vote
+----+------------+---------+-----------+------+
| id | product_id | user_id | vote_type | time |
+----+------------+---------+-----------+------+
| 1 | 1 | 2 | product |224.. |
| 2 | 2 | 2 | page |218.. |
| .. | .. | .. | .. | .. |
And a simple query:
SELECT COUNT(id) as total FROM user_product_vote WHERE vote_type = 'product' AND time BETWEEN(....) ORDER BY time DESC LIMIT 20
Another thing is if a user voted at 1AM and then tried to vote again at 2PM, you can easily check when the last time they voted and if they should be allowed to vote again.
There are so many opportunities that you will be missing if you stick with your incremental example.
In regards to your count(), no matter how much you optimize your queries it would not really make a difference on a large scale.
With an extremely large user-base your resource usage will be looked at from a different perspective such as load balancers, mainly server settings, Apache, catching etc., there's only so much you can do with your queries.
If my site became hugely successful (we can all dream) and I had thousands of users voting on thousands of products, I fear that performing such a COUNT with every page view would be highly inefficient in terms of server resources.
Don't waste your time solving imaginary problems. mysql is perfectly able to process thousands of records in fractions of a second - this is what databases are for. Clean and simple database and code structure is far more important than the mythical "optimization" that no one needs.
Why not mix and match both? Simply have the final counts in the product and users tables, so that you don't have to count every time and have the votes table , so that there is no double posting.
Edit:
To explain it a bit further, product and user table will have a column called "votes". Every time the insert is successfull in user_product_vote, increment the relevant user and product records. This would avoid dupe votes and you wont have to run the complex count query every time as well.
Edit:
Also i am assuming that you have created a unique index on product_id and user_id, in this case any duplication attempt will automatically fail and you wont have to check in the table before inserting. You will just to make sure the insert query ran and you got a valid value for the "id" in the form on insert_id
You have to balance the desire for your site to perform quickly (in which the second schema would be best) and the ability to count votes for specific users and prevent double voting (for which I would choose the first schema). Because you are only using integer columns for the user_product_vote table, I don't see how performance could suffer too much. Many-to-many relationships are common, as you have implemented with user_product_vote. If you do want to count votes for specific users and prevent double voting, a user_product_vote is the only clean way I can think of implementing it, as any other could result in sparse records, duplicate records, and all kinds of bad things.
You don't want to update the product table directly with an aggregate every time someone votes - this will lock product rows which will then affect other queries which are using products.
Assuming that not all product queries need to include the votes column, you could keep a separate productvotes table which would retain the running totals, and keep your userproductvote table as a means to enforce your user voting per product business rules / and auditing.