How to manage entity duplication in database table [closed] - mysql

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last month.
Improve this question
I am working on a simple database design of an application.
I have a Book Illustrator and Editor table.
Modelling 1 Relation between
With this model, I think here is the duplication of the column name in each author editor and illustrator table.
What if a book author, illustrator and editor person are same, in this case, data get duplicated across 3 tables.
But in case of searching it will be faster, I guess as it no of items per table will be less.
Modelling 2
With this modeling, all the author, illustrator and editor info get saved in a single table and I am confused what should be the name of this table.
With this approach. The data won't' get duplicated but the searching will be double as compared to model 1.
Can anyone suggest me which model should I choose. I feel modeling 2 is better.

It is purely up to your taste which model you should use. The second one has the advantage that you wont get duplicates. With both models you can get the results with one query
select * from books
left join names auth ON (auth.id = author_id)
left join names ill ON (ill.id = illustrator_id)
left join names ed ON (ed.id = editor_id)
where books.id = 1;
SQLFiddle gives an example of model 2. If you want to obtain the data from model one, just change the 3 joins to the right table.
If you want to display a list of authors, I would not recommend adding it as a new field in the names table, but just use a joint query.
select auth.* from books
left join names auth ON (auth.id = author_id)
As long as you set the indexes on the id, author_id, illustrator_id and editor_id, you are fine.
Edit: my preference would go to model 2. I think it might also a bit faster:
The database only needs to open one file (not 3)
There are less records in the table (compared to the combined of the 3 tables) because you don't have duplicates.
The database only need to search through one index set (not 3) and might do some optimised stuff in the back because it is looking for 3 keys in the same set (instead of 3 key in 3 index sets) - it's my gut feeling, not sure if this is exactly correct...

You can make one amendment in the 2nd design you have proposed by keeping the user type column as well, which describes whether the user is any of author, illustrator and editor. the id will vary from 0 - 7, you can store the decimal value of the bitwise data. as if a person is Editor & Author then,
1(Editor) 0(Illustrator) 1(Author) => 5
So when you will perform any select/search on that table you can add filters where user type in query.

Do you need to validate, for example, that the author is defined as author in "Author" before you link to a book as author?
Do you care to do a query to know who are all authors/editors/illustrators defined in your database?
You have created N-N link between the entities, however, you have the "auhorId", "editorId" and "illustatorId" in the "Book" entity!
The proper way would be to have the resolution of the many-to-many relationship by having another table, and end up with something like this
BOOK, has ID, TITLE, DESC, etc.
PARTICIPANT (suggested name for all people), has ID, NAME, BIO, etc
AUTHOR, has BOOK_ID, PARTICIPANT_ID
EDITOR, has BOOK_ID, PARTICIPANT_ID
ILLUSTRATORS, has BOOK_ID, PARTICIPANT_ID
OR, instead of (3, 4, 5), BOOK_PARTICIPANT, has BOOK_ID, PARTICIPANT_ID, PARTICIPATION_TYPE (code for author, editor, illustrator), or even use flags (IS_AUTHOR, IS_EDITOR, IS_PARTICIPANT, where one is required to be set)
If you need to validate the participant as author, editor, illustrator before being able to link to a book, you need to add three flags here to to PARTICIPANT: IS_AUTHOR, IS_EDITOR, IS_ILLUSTRATOR
Hope this helps

Related

Split similar data into two tables?

I have two sets of data that are near identical, one set for books, the other for movies.
So we have things such as:
Title
Price
Image
Release Date
Published
etc.
The only difference between the two sets of data is that Books have an ISBN field and Movies has a Budget field.
My question is, even though the data is similar should both be combined into one table or should they be two separate tables?
I've looked on SO at similar questions but am asking because most of the time my application will need to get a single list of both books and movies. It would be rare to get either books or movies. So I would need to lookup two tables for most queries if the data is split into two tables.
Doing this -- cataloging books and movies -- perfectly is the work of several lifetimes. Don't strive for perfection, because you'll likely never get there. Take a look at Worldcat.org for excellent cataloging examples. Just two:
https://www.worldcat.org/title/coco/oclc/1149151811
https://www.worldcat.org/title/designing-data-intensive-applications-the-big-ideas-behind-reliable-scalable-and-maintainable-systems/oclc/1042165662
My suggestion: Add a table called metadata. your titles table should have a one-to-many relationship with your metadata table.
Then, for example, titles might contain
title_id title price release
103 Designing Data-Intensive Applications 34.96 2017
104 Coco 34.12 2107
Then metadata might contain
metadata_id title_id key value
1 103 ISBN-13 978-1449373320
2 103 ISBN-10 1449373320
3 104 budget USD175000000
4 104 EIDR 10.5240/EB14-C407-C74B-C870-B5B6-C
5 104 Sound Designer Barney Jones
Then, if you want to get items with their ISBN-13 values (I'm not familiar with IBAN, but I guess that's the same sort of thing) you do this
SELECT titles.*, isbn13.value isbn13
FROM titles
LEFT JOIN metadata isbn13 ON titles.title_id = metadata.title_id
AND metadata.key='ISBN-13'
This is a good way to go because it's future-proof. If somebody turns up tomorrow and wants, let's say, the name of the most important character in the book or movie, you can add it easily.
The only difference between the two sets of data is that Books have an
IBAN field and Movies has a Budget field.
Are you sure that this difference that you have now will not be
extended to other differences that you may have to take into account
in the future?
Are you sure that you will not have to deal with any other type of
entities (other than books and movies) in the future which will
complicate things?
If the answer in both questions is "Yes" then you could use 1 table.
But if I had to design this, I would keep a separate table for each entity.
If needed, it's easy to combine their data in a View.
What is not easy, is to add or modify columns in a table, even naming them, just to match requirements of 2 or more entities.
You must be very sure about future requests/features for your application.
I can't image what type of books linked with movies you store thus a lot of movies have different titles than books which are based on. Example: 25 films that changed the name.
If you are sure that your data will be persistent and always the same for books and movies then you can create new table for example Productions and there store attributes Title, Price, Image, Release Date, Published. Then you can store foreign keys of Production entity in your tables Books and Movies.
But if any accident happen in the future you will need to rebuild structure or change your assumptions. But anyway it will be easier with entity Production. Then you just create new row with modified values and assign to selected Book or Movie.
Solution with one table for both books and movies is the worst, because if one of the parameters drive away you will add new row and you will have data for first set (real book and non-existing movie) and second set (non-existing book and real movie).
Of course everything is under condition they may be changes in the future. If you are 100% sure, then 1 table is enough solution, but not correct from the database normalization perspective.
I would personally create separate tables for books and movies.

How to design erd where employee is a customer too? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have a scenario of a movie theatre and in this scenario an employee can be a customer too (because they can buy tickets).
I created two entities, customer and employee. If I make emp_id of employee a foreign key in customer then it makes an employee a customer too:
Customer: cust_id, Name, age, emp_id
Employee: emp_id, Name, age
But when I do this, data of employee gets repeated in customer. What should I do??
First thing when I look at your tables, you shouldn't really store age. Store birthdate or birth year, so that you won't have to update it every year, because age is relative to current year.
Please read on, or skip to TL;DR part if you want my opinion about this.
For the issue you are having there can be multiple design choices and I find it primarily opinion based on which one to choose from.
One option would be to create one table for storing person-related data and include type of the person in this table, so that you will either have customers or employees. This way, your employee can also be treated as a customer, but you know that it's a special one. This way you can convert record of an employee to a customer when he/she is no longer an employee.
Another way to approach this is to treat them as you already have it and deal with the fact that you are repeating data. Unless you have a large amount of employees (this should not be the case for movie theatre) this is a valid approach as well, since obviously they can be customers, but I assume you would like to discount them or for some reason distinguish between those two types of clients.
Like to keep it the way it currently is, but don't want to repeat the data in order not to make mistakes? Alter columns which are shared for both tables and make them nullable. Use trigger or some other rule mechanism to check if emp_id is filled and then keep all other common column values null. This way though you will need to take care of pulling the data from additional table, so a LEFT JOIN is required here to pull the data about customers which are also employees.
And there are much more to choose from ...
TL;DR
If you ask me, I'd most likely go with first option to store person-related data within one table and either create a type of person or have different tables for employees and customers which will be in 1:1 relation with person table.
That said, it could look like:
Person (person_id, name, birth_year)
Employee (person_id, ...) (store only employee related data here)
Customer (person_id, ...) (store only customer related data here)
As a side note, it might be a good idea to figure out how do you want to differentiate people in general. You have not presented scope for the entire system, so it's hard to give some advices around that.
If there is only two roles of customer and employee, you can use one table to store the users, like this:
person(id, name, birth, is_customer, is_employee, ...)
But if you consider more extension in the future, you can add a role table, like this:
role(id, name, ...)
person(id, name, birth, roles, ...)
the column roles can store with json data like [1,2] or string like 1,2, it depends what you like and mysql version.

Database Design: User Profiles like in Meetup.com

In Meetup.com, when you join a meetup group, you are usually required to complete a profile for that particular group. For example, if you join a movie meetup group, you may need to list the genres of movies you enjoy, etc.
I'm building a similar application, wherein users can join various groups and complete different profile details for each group. Assume the 2 possibilities:
Users can create their own groups and define what details to ask users that join that group (so, something a bit dynamic -- perhaps suggesting that at least an EAV design is required)
The developer decides now which groups to create and specify what details to ask users who join that group (meaning that the profile details will be predefined and "hard coded" into the system)
What's the best way to model such data?
More elaborate example:
The "Movie Goers" group request their members to specify the following:
Name
Birthdate (to be used to compute member's age)
Gender (must select from "male" or "female")
Favorite Genres (must select 1 or more from a list of specified genres)
The "Extreme Sports" group request their member to specify the following:
Name
Description of Activities Enjoyed (narrative form)
Postal Code
The bottom line is that each group may require different details from members joining their group. Ideally, I would like anyone to create a group (ala MeetUp.com). However, I also need the ability to query for members fairly well (e.g. find all women movie goers between the ages of 25 and 30).
For something like this....you'd want maximum normalization, so you wouldn't have duplicate data anywhere. Because your user-defined tables could possibly contain the same type of record, I think that you might have to go above 3NF for this.
My suggestion would be this - explode your tables so that you have something close to 6NF with EAV, so that each question that users must answer will have its own table. Then, your user-created tables will all reference one of your question tables. This avoids the duplication of data issue. (For instance, you don't want an entry in the "MovieGoers" group with the name "John Brown" and one in the "Extreme Sports" group with the name "Johnny B." for the same user; you also don't want his "what is your favorite color" answer to be "Blue" in one group and "Red" in another. Any data that can span across groups, like common questions, would be normalized in this form.)
The main drawback to this is that you'd end up with a lot of tables, and you'd probably want to create views for your statistical queries. However, in terms of pure data integrity, this would work well.
Note that you could probably get away with only factoring out the common fields, if you really wanted to. Examples of common fields would include Name, Location, Gender, and others; you could also do the same for common questions, like "what is your favorite color" or "do you have pets" or something to that extent. Group-specific questions that don't span across groups could be stored in a separate table for that group, un-exploded. I wouldn't advise this because it wouldn't be as flexible as the pure 6NF option and you run the risk of duplication (how do you predetermine which questions won't be common questions?) but if you really wanted to, you could do this.
There's a good question about 6NF here: Would like to Understand 6NF with an Example
I hope that made some sense and I hope it helps. If you have any questions, leave a comment.
Really, this is exactly a problem for which SQL is not a right solution. Forget normalization. This is exactly the job for NoSQL document stores. Every user as a document, having some essential fields like id, name, pwd etc. And every group adds possibility to add some fields. Unique fields can have names group-id-prefixed, shared fields (that grasp some more general concept) can have that field name free.
Except users (and groups) then you will have field descriptions with name, type, possible values, ... which is also very good for a document store.
If you use key-value document store from the beginning, you gain this freeform possibility of structuring your data plus querying them (though not by SQL, but by the means this or that NoSQL database provides).
First i'd like to note that the following structure is just a basis to your DB and you will need to expand/reduce it.
There are the following entities in DB:
user (just user)
group (any group)
template (list of requirement united into template to simplify assignment)
requirement (single requirement. For example: date of birth, gender, favorite sport)
"Modeling":
**User**
user_id
user_name
**Group**
name
group_id
user_group
user_id (FK)
group_id (FK)
**requirement**:
requirement_id
requirement_name
requirement_type (FK) (means the type: combo, free string, date) - should refers to dictionary)
**template**
template_id
template_name
**template_requirement**
r_id (FK)
t_id (FK)
The next step is to model appropriate schema for storing restrictions, i.e. validating rule for any requirement in any template. We have to separate it because for different groups the same restrictions can be different (for example: "age"). You can use the following table:
**restrictions**
group_id
template_id
requirement_id (should be here as template_id because the same requirement can exists in different templates and any group can consists of many templates)
restriction_type (FK) (points to another dict: value, length, regexp, at_least_one_value_choosed and so on)
So, as i said it is the basis. You can feel free to simplify this schema (wipe out tables, multiple templates for group). Or you can make it more general adding opportunity to create and publish temaplate, requirements and so on.
Hope you find this idea useful
You could save such data as JSON or XML (Structure, Data)
User Table
Userid
Username
Password
Groups -> JSON Array of all Groups
GroupStructure Table
Groupid
Groupname
Groupstructure -> JSON Structure (with specified Fields)
GroupData Table
Userid
Groupid
Groupdata -> JSON Data
I think this covers most of your constraints:
users
user_id, user_name, password, birth_date, gender
1, Robert Jones, *****, 2011-11-11, M
group
group_id, group_name
1, Movie Goers
2, Extreme Sports
group_membership
user_id, group_id
1, 1
1, 2
group_data
group_data_id, group_id, group_data_name
1, 1, Favorite Genres
2, 2, Favorite Activities
group_data_value
id, group_data_id, group_data_value
1,1,Comedy
2,1,Sci-Fi
3,1,Documentaries
4,2,Extreme Cage Fighting
5,2,Naked Extreme Bike Riding
user_group_data
user_id, group_id, group_data_id, group_data_value_id
1,1,1,1
1,1,1,2
1,2,2,4
1,2,2,5
I've had similar issues to this. I'm not sure if this would be the best recommendation for your specific situation but consider this.
Provide a means of storing data as XML, or JSON, or some other format that delimits the data, but basically stores it in field that has no specific format.
Provide a way to store the definition of that data
Provide a lookup/index table for the data.
This is a combination of techniques indicated already.
Essentially, you would create some interface to your clients to create a "form" for what they want saved. This form would indicated what pieces of information they want from the user. It would also indicate what pieces of information you want to search on.
Save this information to the definition table.
The definition table is then used to describe the user interface for entering data.
Once user data is entered, save the data (as xml or whatever) to one table with a unique id. At the same time, another table will be populated as an index with
id where the xml data was saved
name of field data is stored in
value of field data stored.
id of data definition.
now when a search commences, there should be no issue in searching for the information in the index table by name, value and definition id and getting back the id of the xml/json (or whatever) data you stored in the table that the data form was stored.
That data should be transformable once it is retrieved.
I was seriously sketchy on the details here, I hope this is enough of an answer to get you started. If you would like any explanation or additional details, let me know and I'll be happy to help.
if you're not stuck to mysql, i suggest you to use postgresql which provides build-in array datatypes.
you can define a define an array of varchar field to store group specific fields, in your groups table. to store values you can do the same in the membership table.
comparing to string parsing based xml types, this array approach will be really fast.
if you dont like array approach you can check out xml datatypes and an optional hstore datatype which is a key-value store.

What is the most efficient way to store a sort-order on a group of records in a database? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Assume PHP/MYSQL but I don't necessarily need actual code, I'm just interested in the theory behind it.
A good use-case would be Facebook's photo gallery page. You can drag and drop a photo on the page, which fires an Ajax event to save the new sort order. I'm implementing something very similar.
For example, I have a database table "photos" with about a million records:
photos
id : int,
userid : int,
albumid : int,
sortorder : int,
filename : varchar,
title : varchar
Let's say I have an album with 100 photos. I drag/drop a photo into a new location and the Ajax event fires off to save on the server.
Should I be passing the entire array of photo ids back to the server and updating every record? Assume input validation by "WHERE userid=loggedin_id", so malicious users can only mess with the sort order of their own photos
Should I be passing the photo id, its previous sortorder index and its new sortorder index, retrieve all records between these 2 indices, sort them, then update their orders?
What happens if there are thousands of photos in a single gallery and the sort order is changed?
What about just using an integer column which defines the order? By default you assign numbers * 1000, like 1000, 2000, 3000.... and if you move 3000 between 1000 and 2000 you change it to 1500. So in most cases you don't need to update the other numbers at all. I use this approach and it works well. You could also use double but then you don't have control about the precision and rounding errors, so rather don't use it.
So the algorithm would look like: say you move B to position after A. First perform select to see the order of the record next to A. If it is at least +2 higher than the order of A then you just set order of B to fit in between. But if it's just +1 higher (there is no space after A), you select the bordering records of B to see how much space is on this side, divide by 2 and then add this value to the order of all the records between A and B. That's it!
(Note that you should use transaction/locking for any algorithm which contains more than a single query, so this applies to this case too. The easiest way is to use InnoDB transaction.)
Store as a linked list, sortorder is a foreign key reference to the next photo_id in the set.
this would probably be a 'linked list' construct.
To me the second method of updating is the way to go (update only the range that changes). You are mentioning "What happens if there are thousands of photos in a single gallery ...", and to me that is never going to happen. Lets take your facebook example. Facebook doesn't show thousands of photos on one page, they split it up to about 10-20 per page.
The way I'd do this in a nonrelational database is to store a list of photo IDs on the 'album' entity/record, in the order desired. Reordering the photos results in reordering the list, and only a single database write.
Some SQL databases (Eg, PostgreSQL) have native list datatypes, but MySQL doesn't. You could serialize the list as a string or binary on MySQL.
3rd-normal-form trained database gurus will scream at you that this is a terrible approach, but RDBMSes are optimized for OLAP type queries, where query flexibility is more important than read performance. Webapps are best written with a 'write heavy, read light' strategy in mind, and this sort of denormalization is exactly in line with that.

A Beginner Question on database design

this is a follow-up question on my previous one.We junior year students are doing website development for the univeristy as volunteering work.We are using PHP+MySQL technique.
Now I am mainly responsible for the database development using MySQL,but I am a MySQL designer.I am now asking for some hints on writing my first table,to get my hands on it,then I could work well with other tables.
The quesiton is like this,the first thing our website is going to do is to present a Survey to the user to collect their preference on when they want to use the bus service.
and this is where I am going to start my database development.
The User Requirement Document specifies that for the survey,there should be
Customer side:
Survery will be available to customers,with a set of predefined questions and answers and should be easy to fill out
Business side:
Survery info. will be stored,outputed and displayable for analysis.
It doesnt sound too much work,and I dont need to care about any PHP thing,but I am just confused on :should I just creat a single table called " Survery",or two tables "Survey_business" and "Survey_Customer",and how can the database store the info.?
I would be grateful if you guys could give me some help so I can work along,because the first step is always the hardest and most important.
Thanks.
I would use multiple tables. One for the surveys themselves, and another for the questions. Maybe one more for the answer options, if you want to go with multiple-choice questions. Another table for the answers with a record per question per answerer. The complexity escalates as you consider multiple types of answers (choice, fill-in-the-blank single-line, free-form multiline, etc.) and display options (radio button, dropdown list, textbox, yada yada), but for a simple multiple-choice example with a single rendering type, this would work, I think.
Something like:
-- Survey info such as title, publish dates, etc.
create table Surveys
(
survey_id number,
survey_title varchar2(200)
)
-- one record per question, associated with the parent survey
create table Questions
(
question_id number,
survey_id number,
question varchar2(200)
)
-- one record per multiple-choice option in a question
create table Choices
(
choice_id number,
question_id number,
choice varchar2(200)
)
-- one record per question per answerer to keep track of who
-- answered each question
create table Answers
(
answer_id number,
answerer_id number,
choice_id number
)
Then use application code to:
Insert new surveys and questions.
Populate answers as people take the surveys.
Report on the results after the survey is in progress.
You, as the database developer, could work with the web app developer to design the queries that would both populate and retrieve the appropriate data for each task.
only 1 table, you'll change only the way you use the table for each ocasion
customers side insert data into the table
business side read the data and results from the same table
Survey.Customer sounds like a storage function, while Survey.Business sounds like a retrieval function.
The only tables you need are for storage. The retrieval operations will take place using queries and reports of the existing storage tables, so you don't need additional tables for those.
Use a single table only. If you were to use two tables, then anytime you make a change you would in effect have to do everything twice. That's a big pain for maintenance for you and anyone else who comes in to do it in the future.
most of the advice/answers so far are applicable but make certain (unstated!) assumptions about your domain
try to make a logical model of the entities and attributes that are required to capture the requirements, examine the relationships, consider how the data will be used on both sides of the process, and then design the tables. Talk to the users, talk to the people that will be running the reports, talk to whoever is designing the user interface (screens and reports) to get the complete picture.
pay close attention the the reporting requirements, as they often imply additional attributes and entities not extant in the data-entry schema
i think 2 tables needed:
a survey table for storing questions and choices for answer. each survey will be stored in one row with a unique survey id
other table is for storing answers. i think its better to store each customers answer in one row with a survey id and a customer id if necessary.
then you can compute results and store them in a surveyResults view.
Is the data you're presenting as the questions and answers going to be dynamic? Is this a long-term project that's going to have questions swapped in and out? If so, you'll probably want to have the questions and answers in your database as well.
The way I'd do it would be to define your entities and figure out how to design your tables so relationships are straightforward. Sounds to me like you have three entities:
Question
Answer
Completed Survey
Just a sample elaboration of what Steven and Chris has mentioned above.
There are gonna be multiple tables, if there are gonna be multiple surveys, and each survey has a different set of questions, and if same user can take multiple surveys.
Customer Table with CustID as the primary key
Questions Table with a Question ID as the primary key. If a question cannot belong to more than one survey (a N:1 relationship), then can also have Survey ID (of table Survey table mentioned in point 3) as one of the values in the table.
But if a Survey to Question relationship is N:M, then
(SurveryID, QuestionID) would become a composite key for the SurveyTable, else it would just have the SurveyID with the high level details of the survey like description.
UserSurvey table which would contain (USerID, SurveryID, QuestionID, AnswerGiven)
[Note: if same user can take the same survey again and again, either the old survey has to be updated or the repeat attempts have to stored as another rows with some serial number)