join, foreign key and other relational DB planning - mysql

I am new to database planning and in programming in general.
I need to develop desk app for realtors.
It needs to have at least 2 tables:
property_table - id, license #, address, city, bedrooms, baths, laundry, etc, etc.
image_table - id, picture_name, path, size (image related DB)
(it will probably need a agent_table, but lets keep things simple).
Property_table will have only one address per ID. A new entry with same address has to generate new ID (a person re-selling same house).
But image_table may have 10 entries for the same property address.
I am using PHP Session to bring address, city, zip code between table to avoid mistakes from user (therefore image_table is actually id, picture_name, path, size, address, city, zip code, username).
QUESTION: should I use a foreign key? Or just join in my searches? Many questions about this, like here, good tutorials on joins, etc., etc. It seems I have to use join query. What about the foreign key?
WHY: I need to show the listings like coming from different BD. Address (table-1) has several pictures (table-2).
PLANNING AHEAD. In the long term, same address will have more than one entry (same address, same zip code).
Just confused with so much new information and trying to plan ahead.
Thank you so much for your time.

property_table - Property_table_id(PK), license #, address, city, bedrooms, baths, laundry, etc, etc.
image_table - image_table_id(PK),Property_table_id(FK), picture_name, path, size (image related DB)
Select * from property_table PropTable
Inner Join Image_Table imgTable on PropTable.Property_table_id = imgTable.Property_table_id

In general I would say 'yes' to using the Foreign Key (FK). I am just assuming that since you are using PHP that you are probably using one of the many popular free or open source databases such as MySQL for your relational database back-end. Having the FK will allow you to set constraints on your data, to prevent you from making a mistake that may cause your program to error out.
For example, in your scenario you have the property_table table, which will have several addresses, in which some of them may use the same images. In this case, you would want a column in your property_table, maybe property_table.image_id that is a a FK to your image_table table, referencing the column image_table.id.
If you set up your constraints properly on the FK, you will prevent yourself from accidentally entering the id of an image in the images_table table that does not exist. You can also use constraints to automatically manage the references for you in case you do something with the data in refered (images_table) table. For example, if you delete the image from images_table you could have all references to that image automatically set to NULL (an empty value) inside the property_table.

Foreign key is a constraint, JOIN is a query method. While it is true that JOINs are often (but not always) done "on top" of foreign keys, they are not the same thing.
So, if you need to ensure there are no "dangling pointers" (as you do between image_table and property_table), use foreign key. Always enforce the data integrity at the database level, even if you also enforce it in the UI1
If you need to get the related data from two (or more) tables using a single query, use JOIN.
1 Which will protect your data in case of bugs, especially subtle concurrency bugs that almost certainly exist in your code unless you employed locking very carefully. Furthermore, if you ever have to create another application that accesses the same database, it will benefit from integrity constraints that are already there. And if you ever modify the data ad-hoc through the generic UI provided by your DBMS, it will be more difficult to "break" the data.

Related

what is the best practice - a new column or a new table?

I have a users table, that contains many attributes like email, username, password, phone, etc.
I would like to save a new type of data (integer), let's call it "superpower", but only very few users will have it. the users table contains 10K+ records, while fewer than 10 users will have a superpower (for all others it will be null).
So my question is which of the following options is more correct and better in terms of performance:
add another column in the users table called "superpower", which will be null for almost all users
have a new table calles users_superpower, which will at most contains 10 records and will map users to superpowers.
some things i have thought about:
a. the first option seems wasteful of space, but it really just an ingeger...
b. the second option will require a left join every time i query the users...
c. will the answer change if "superpower" data was 5 columns, for example?
note: i'm using hibenate and mysql, if it changes the answer
This might be a matter of opinion. My viewpoint on this follows:
If superpower is an attribute of users and you are not in the habit of adding attributes, then you should add it as a column. 10,000*4 additional bytes is not very much overhead.
If superpower is just one attribute and you might add others, then I would suggest using JSON or another EAV table to store the value.
If superpower is really a new type of user with other attributes and dates and so on, then create another table. In this table, the primary key can be the user_id, making the joins between the tables even more efficient.
I would go with just adding a new boolean field in your user entity which keeps track of whether or not that user has superpowers.
Appreciate that adding a new table and linking it requires the creation of a foreign key in your current users table, and this key will be another column taking up space. So it doesn't really get around avoiding storage. If you just want a really small column to store whether a user has superpowers, you can use a boolean variable, which would map to a MySQL BIT(1) column. Because this is a fixed width column, NULL values would still take up a single bit of space, but this not a big storage concern most likely as compared to the rest of your table.

ERD design pro and contra

I have one question regarding database design.
Here is the first example:
User may have a multiple Websites, and user can request specific resource for every of his websites. All requests are saved in RequestForResource table.
Now, if I want to see the name of an user who requested a resource, I have to join tables RequestForResource Website and table User.
To avoid this, I can make foreign key between RequestForResource and User table like it is demonstrated here:
Now, in order to get an user name, I have to join table RequestForResource and table User which is probably easier for SQL server, but at the other hand I have one foreign key more.
Which approach is better and (or) faster and why?
You can always duplicate information to gain execution speed. This is called: denormalisation. Yes, it will probably speed up the queries by lowering the required count of index seeks.
BUT
You have to write your code to make sure, that the data is consistent:
With the second design it is possible, to insert Website.User_idUser and a RequestForResource.User_idUser with different IDs for the same site! According to the design this is valid (but probably this will not satisfy your business rules).
Consider to update the foreign key constraint (or add a second one) which refers only to the Website table (User_idUser, Website_idWebsite) and remove the User-RequestForResource one.
Also consider to build a view to query your data with all the required info (probably with a clustered index).

Mysql deduce foreign key relationship for random queries

I am an MySQL novice and am looking for the solution to the following problem:
I would like to create a CMS with cppcms which shall be capable to have modules. Since I want to reduce the chance of (accidental) access to private data, I want a module which handles data access and rights. Since this module is supposed to be unaware of data structures created by other modules I would like it to deduce the data owner through foreign key relations. My idea would be to search for a path (over foreign keys) which links a row to a user id.
Sum up:
What I am trying to do
Taking a random query, determine the affected rows
for the affected rows determine a relationship/path (via foreign keys) to a user/userid (a column in an existing table)
return only the rows for which a relationship could be determined and a condition holds (e.g. the userid found in the related query matches a fixed user id, such as the user currently accessing the system)
(As far as I know foreign keys only enforce the existence of a key in another table, however the precondition I assume is, that every row is linked to a user over a path of foreign key relations)
My Problem/Question:
Is there an existing solution/Better approach to the problem? Prepared statements wont do the trick since I don't know all datastructures/queries in advance.
How do I get the foreign key relations? Is there another way besides "SHOW CREATE TABLE" and then parsing the result string?
How can I determine the rows that would be affected, without modifing them? I would like to filter this set afterwards by determining if I can link it to the current user (not the mysql user but system user).
Could I try executing the query, and then select the affect rows, and if I determine an access violation simply do a rollback? Problem with this: how to do the changes to the subset of rows for which it is legal (e.g. I attempt to change 5 rows, may only change 2, how to only change those 2). One idea was to search a way to create a temporary table with the result set; this solution has several drawbacks: foreign key relations are not possilbe for temporary tables, they are 'lost'.
P.S.: I am coding in c++, therfore I would prefer cpp-compatible library recommendations, however I am open to other suggestions. While googling I stumbled over doctrine and Iam currently researching it.
P.P.S.: Database engine is InnoDB (has to because of the foreign keys)
UPDATE: Explanation Attempt of Part 2:
I am trying to filter which collumns a user is allowed to see of tables. To do so I would like to find a connection in the database over foreign keys (By foreign keys I ensure that I can get to all data over joins, and they are a hint on which columns I have to join). Since I plan on a complexer system (e.g. forum) I don't want to join all data in a temporary table and run a user query on those. I would rather evaluate the userquery and check for the result if I can map it with a join to the users id. For example I could use this to enforce that an edit button is only enabled for the posts created by the user. (I know there are easier ways to do this, but I basically want to allow programmers to write their own queries without giving them the chance to edit or view data that they are not allowed to see. My assumption is that the programmer is not an evildoer but simply forgetting constraints, thus I want to enforce them in software).
Getting here would be pretty good, but I have a little more complex need.
First a basic example. Let's say its like facebook and all the friends of a person are allowed to see his pictures.
pictures = id **userid** file (bool)visibleForFriends album
friendship = **userid1** **userid2**
users = userid
What I want to happen is:
Programmer input "SELECT * FROM pictures WHERE album=2"
System gets all matching records (e.g. set of ids)
System sees foreign key userid, tries to match current userid against the pictures userid, adds all matching to the returned result part
System notices special column visibleForFriends
System tries to determin all Friends (SELECT userid1 FROM friendship WHERE userid2=currentUserID join (have to read up on joins) SELECT userid2 FROM friendship WHERE userid1 =currentUserID)
System adds all rows where visibleForFriends is true and pictures.userid=Result from 5.
While the Friendship part is some extra code (I think doable if igot started on the first bit), I still need to figure out how to automatically follow the foreign keys to see the connection. Ignoring the special Friendship case (special case), I would like the system to work on this as well:
pictures = id **albumid** file (bool)visibleForFriends album
albums = id **userid**
users = userid
Now the system should go pictures.albumid ==> albums.id -> albums.userid ==> users.userid.
I hope the examples clarified the question a bit. One problem is, that in point one from the example (programmer query input) I dont want to let "DELETE *" take effect on anything not owned by the user. So I have to filter which rows to actually delete.
In response to part of your answer (part 1), providing the Mysql user you access the database with has access rights to information_schema then you can use the following query to understand existing foreign key relations within a specific database:
SELECT
TABLE_NAME,
COLUMN_NAME,
REFERENCED_TABLE_NAME,
REFERENCED_COLUMN_NAME
FROM
information_schema.KEY_COLUMN_USAGE
WHERE
TABLE_SCHEMA = 'dbname' AND REFERENCED_COLUMN_NAME IS NOT NULL;
I am slightly confused by the part 2 and am unsure how to give an appropriate response to this section. I hope you find the above query helpful though in your project!
Is there an existing solution/Better approach to the problem?
Yes, I think so. You're describing a multi-tenant database. In a multi-tenant database in which the users share tables (also known as "shared everything"), each table should have a column for the user id. In effect, each row knows its owner.
This will vastly simplify your SQL, since you need no joins to determine who a row belongs to. it will probably speed up your SQL a lot, too.
This SO answer has a decent summary of the issues and alternatives.

Always create unique keys whenever possible?

Should you always create unique keys whenever possible?
For example let's say I have a table with three fields, student ID, first name, last name and the student ID is the primary key.
If no two students have the first & last name, should I create a unique key for those two fields?
Yes, you should use unique indexes even when you already have a primary key when the column or combination of columns are unique. It's good to have constraints in your database to prevent bad data. However, this is not what you have in your case. Even if you currently have no students with duplicate names that can easily happen in the future. Names are not unique in the world.
U.S. Social Security numbers are almost always unique (they can be reused after a number of years, but it's unlikely to ever happen in your case), so they might make for a good candidate for a unique index. If you have non-U.S. students though then you would need to make the column nullable.
Yes, usually having unique IDs (surrogate keys) is best. In this case, last name and first name are not enough for a primary key. Even if you no duplicate names now, you can't be sure you won't have two John Smith's in the future.
Don't make the assumption that no two students will have the same name.
When the underlying model suggests it, it is a good idea to create unique keys. Constraints like these will ensure cohesive data and prevent errors. But in your case the underlying model does not suggest this to be the case.
Unique keys should follow business definitions; if the studentID is a "semi-natural" key (it has unique meaning that exists beyond your specific database), then that should suffice as your unique key.
If the studentID is simply an identity value that is assigned by the database as a row-number, then you probably need some other unique key to avoid entering the same student twice.
Primitive primary key with no relation to data domain is one of widely accepted best practices
( just imagine - one of your students decides to marry )
Another good practice (though from NoSql) world is to use GUID - this way keys are unique, and different datasets can be mixed in same table without collisions.
PS: you could save some storage space, but today it is cheap and there is no need to sacrifice good practices for it
Yes!
If you ever need to update or delete rows from the table, it is very advantageous to have something to uniquely identify each row in the table.
With your example, I don't think it's possible to guarantee no two students will share the same name. Even adding a date of birth still can't guarantee they'll always be unique. I'd recommend adding an auto incrementing INT or BIGINT as the primary key.
You can always add the Unique constraint as well and remove it if it becomes an issue.
A simple way to do it is use an auto-generated Guid (Globally Unique Identifier) to identify a student. It is "guarenteed" to be unique every time it is generated. Names can change (like when somebody gets married), but some auto generated value has no meaning so should never need to be changed.
http://en.wikipedia.org/wiki/Globally_unique_identifier
Your database constraints should be DBMS understood business rules. Is there a business rule that states that no two students may have the same first and last name combination? I presume not, therefore do not create a unique key for those two fields. Perhaps best not to presume, though, and ask a business domain expert e.g. the enrolment officer.
Note that a row in this table is a proposition I.e. that there exists a student enrolled with first name 'x' and last name 'y' and student ID 'z'. Clearly the DBMS has not concept of whether this proposition is true in the real world. What normally happens is that there will be a trusted source to verify data. The enterprise will authorize an officer (director etc) in this role. Let's say it is the enrolment officer who is responsible for verifying that 'x y' is a real person, that they are eligible to be enrolled, and the person is who they say they are. Typically, they will require sight of documents (certificates, passport, etc), take up references, interview the person, check public records, etc. Of course, the enrolment officer may delegate their responsibility to other members of staff or engage an agent.
At some point they will be satisfied and for convenience will issue they own identifier, the student ID. Mistakes do happen and it may turn out that this value is not unique, in which case it would be the enrolment officer's responsibility to resolve the problem and issue a new student to. Perhaps they will use software to generate the value to mitigate against such problems. The student ID will be issued to the student and will be used within the enterprise to identify the person for the convenience of all concerned. They may even be issued with a document (e.g. photo ID card) to assist in identification, based on the level of trust in a given context (e.g. may need to produce photo ID to sit an exam). If the student forgets their ID, loses their issued documents, etc then the enrolment office will be able to retrieve it from records e.g. with reference to copy documents taken during the verification process; they are unlikely to use first name and last name alone.
The point is, the trusted source for the identifier is the enrolment officer on behalf of the enterprise, rather than the database, the DBMS or any other kind of software involved in the process. Therefore, it probably is acceptable to make student ID the sole identifier for stents within the database. Consider, however, that an auto-increment column generated on one hardware build of a single DBMS within the enterprise is probably not suitable for the allocation of such significant identifier values.

Trying to avoid "Polymorphic Associations" and uphold foreign key referential integrity

I'm building a site similar to Yelp (Recommendation Engine, on a smaller scale though), so there will be three main entities in the system: User, Place (includes businesses), and Event.
Now what I'm wondering about is how to store information such as photos, comments, and 'compliments' (similar to Facebook's "Like") for each of these type of entity, and also for each object they can be applied to (e.g. comment on a recommendation, photo, etc). Right now the way I was doing it was a single table for each i.e.
Photo (id, type, owner_id, is_main, etc...)
where type represents: 1=user, 2=place, 3=event
Comment (id, object_type, object_id, user_id, content, etc, etc...)
where object_type can be a few different objects like photos, recommendations, etc
Compliment (object_id, object_type, compliment_type, user_id)
where object_type can be a few different objects like photos, recommendations, etc
Activity (id, source, source_type, source_id, etc..) //for "activity feed"
where source_type is a user, place, or event
Notification (id, recipient, sender, activity_type, object_type, object_id, etc...)
where object_type & object_id will be used to provide a direct link to the object of the notification e.g. a user's photo that was complimented
But after reading a few posts on SO, I realized I can't maintain referential integrity with a foreign key since that's requires a 1:1 relationship and my source_id/object_id fields can relate to an ID in more than one table. So I decided to go with the method of keeping the main entity, but then break it into subsets i.e.
User_Photo (photo_id, user_id) | Place_Photo(photo_id, place_id) | etc...
Photo_Comment (comment_id, photo_id) | Recommendation_Comment(comment_id, rec_id) | etc...
Compliment (id, ...) //would need to add a surrogate key to Compliment table now
Photo_Compliment(compliment_id, photo_id) | Comment_Compliment(compliment_id, comment_id) | etc...
User_Activity(activity_id, user_id) | Place_Activity(activity_id, place_id) | etc...
I was thinking I could just create views joining each sub-table to the main table to get the results I want. Plus I'm thinking it would fit into my object models in Code Igniter as well.
The only table I think I could leave is the notifications table, since there are many object types (forum post, photo, recommendation, etc, etc), and this table will only hold notifications for a week anyway so any ref integrity issues shouldn't be much of a problem (I think).
So am I going about this in a sensible way? Any performance, reliability, or other issues that I may have overlooked?
The only "problem" I can see is that I would end up with a lot of tables (as it is right now I have about 72, so I guess i would end up with a little under 90 tables after I add the extras), and that's not an issue as far as I can tell.
Really grateful for any kind of feedback. Thanks in advance.
EDIT: Just to be clear, I'm not concerned if i end up with another 10 or so tables. From what I know, the number of tables isn't too much of an issue (once they're being used)... unless you had say 200 or so :/
Some propositions for this UoD (universe of discourse)
User named Bob logged in.
User named Bob uploaded photo number 56.
There is a place named London.
Photo number 56 is of place named London.
User named Joe created comment "very nice" on photo number 56.
To introduce object IDs
User (UserID) logged in.
User (UserID) uploaded Photo (PhotoID).
There is Place (PlaceID).
Photo (PhotoID) is of Place (PlaceID).
User (UserID) created Comment (CommentID) on Photo (PhotoID).
Just Fact Types
User logged in.
User uploaded Photo.
Place exists.
Photo is of Place.
User created Comment on Photo.
Now to extract predicates
Predicate Predicate Arity
---------------------------------------------
... logged in 1 (Unary predicate)
... uploaded ... 2 (Binary)
... exists 1 (Unary)
... is of ... 2 (Binary)
... created ... on ... 3 (Ternary)
It looks like each proposition is this UoD may be stated with max ternary predicate,
so I would suggest something like
Predicate role (Role_1_ID, Role_2_ID, Role_3_ID) is a part that an object plays in a predicate. Substitute the ... in a predicate from left to right with each Role_ID.
Note that only Role_1_ID is mandatory (at least unary predicate), the other two may be NULL.
In this simple model, it is possible to propose anything.
Hence, you would need to implement constraints on the application layer.
For example, you have to make sure that it is possible to create Comment on Place, but not create Place on Place.
Not all predicates represents action, for example ... logged in is an action while ... is of ... is not.
So, your activity feed would list all Propositions with Predicate.IsAction = True.
If you rearrange things slightly, you can simplify your comments and compliments. Essentially you want to have a single store of comments and another one of compliments. Your problem is that this won't let you use declarative referential integrity (foreign key constraints).
The way to solve this is to make sure that the objects that can attract comments and compliments are all logical sub-types of one supertype. From a logical perspective, it means you have an "THING_OF_INTEREST" entity (I'm not making a naming convention recommendation here!) and each of the various specific things which attract comments and compliments will be a sub-type of THING_OF_INTEREST. Therefore your comments table will have a "thing_of_interest_id" FK column and similarly for your compliments table. You will still have the sub-type tables, but they will have a 1:1 FK with THING_OF_INTEREST. In other words, THING_OF_INTEREST does the job of giving you a single primary key domain, whereas all of the sub-type tables contain the type-specific attributes. In this way, you can still use declarative referential integrity to enforce your comment and compliment relationships without having to have separate tables for different types of comments and compliments.
From a physical implementation perspective, the most important thing is that your various things of interest all share a common primary key domain. That's what lets your comment table have a single FK value that can be easily joined with whatever that thing of interest happens to be.
Depending on how you go after your comments and recommendations, you probably will (but may not) need to physically implement THING_OF_INTEREST - which will have at least two attributes, the primary key (usually an int) plus a partitioning attribute that tells you which sub-type of thing it is.
If you need referential integrity (RI) there is no better way to do it than to use many-to-many junction tables. True, you end up having a lot of tables in the system, but that's the cost you need to pay. It also has some other benefits going this route, for instance you get some sort of partitioning for free: you get the data partitioned by their relation type, each in its own table. This offers RI but it is not 100% safe either, for instance there's nothing to guarantee you that a comment belongs to a photo and to that photo alone, you'd need to enforce this kind of constraints manually should you need them.
On the other hand, going with a generic solution like you already did gets you faster off the ground and it's way easier to extend in the future but there'll be no RI unless you'll code it manually (which is very complex and a lot harder to deal with than the alternative M:M for every relation type).
Just to mention another alternative, similar to your existing implementation, you could use a custom M:M junction table to handle all your relations regardless of their type: object1_type, object1_id, object2_type, object2_id. Simple but no other benefit beside very easy to implement and extend. I'd only recommend it if you don't need RI and you got yourself a lot of tables, all interlinked.