Database Design: Composite key vs one column primary key - mysql

A web application I am working on has encountered an unexpected 'bug' - The database of the app has two tables (among many others) called 'States' and 'Cities'.
'States' table fields:
-------------------------------------------
idStates | State | Lat | Long
-------------------------------------------
'idStates' is an auto-incrementing primary key.
'Cities' table fields:
----------------------------------------------------------
idAreaCode | idStates | City | Lat | Long
----------------------------------------------------------
'idAreaCode' is a primary key consisting of country code + area code (e.g. 91422 where 91 is the country code for india and 422 is the area code of a city in India). 'idStates' is a foreign key derived from 'States' table to associate each city in the 'Cities' table with its corresponding State.
We figured that the country code + area code combination would be unique for each city, and thus could safely be used as a primary key. Everything was working. But a location in India found an unexpected 'flaw' in the db design - India, like the US is a federal democracy and is geographically divided into many states or union territories. Both the states and union territories data is stored in the 'States' table. There is, however, one location - Chandigarh - which belongs to TWO states (Haryana and Punjab) and is also a union territory by itself.
Obviously, the current db design doesn't allow us to store more than one record of the city 'Chandigarh'.
One of the solutions suggested is to create a primary key combining the columns 'idAreaCode' and 'idStates'.
I'd like to know if this is the best solution possible?
(FYI: we are using MySQL with the InnoDB engine).
More information:
The database stores meteorological information for each city. Thus, the state and city are the starting point of each query.
Fresh data for each city is inserted everyday using a CSV file. The CSV file includes an idStates (for state) and idAreaCode (for city) column which is used to identify each record.
Database normalization is important to us.
Note: The reason for not using an auto incrementing primary key for the city table is that the database is updated everyday / hourly using a CSV file (which is generated by another app). And each record in the CSV file is identified by the idStates and idAreaCode column. Hence it is preferred that the primary key used in the city table is the same for every city, even if the table is deleted and refreshed again. Zip codes (or pin codes) and area codes (or STD codes) meet the criteria of being unique, static (don't change often) and a ready list of these are easily available. (We decided on area codes for now because India is in the process of updating its pin codes to a new format).
The solution we decided on was to handle this at the application level instead of making changes to the database design. In the database we will only be storing one record of 'Chandigarh'. In the application we've created a flag for any search for 'Chandigarh, Punjab' or 'Chandigarh, Haryana' to redirect search to this record. Yeah, it's not ideal, but an acceptable compromise since this is the ONLY exception we've come across so far.

It sounds like you are gathering data for a telephone directory. Are you? Why are states important to you? The answer to this question will probably determine which database design will work best for you.
You may think that it's obvious what a city is. It's not. It depends on what you are going to do with the data. In the US, there is this unit called MSA (Metropolitan Statistical Area). The Kansas City MSA spans both Kansas City, Kansas and Kansas City, Missouri. Whether the MSA unit makes sense or not depends on the intended use of the data.
If you used area codes in US to determine cities, you'd end up with a very different grouping than MSAs. Again, it depends on what you are going to do with the data.
In general whenever hierarchical patterns of political subdivisions break down, the most general solution is to consider the relationship many-to-many. You solve this problem the same way you solve other many-to-many problems. By creating a new table, with two foreign keys. In this case the foreign keys are IdAreacode and IdStates.
Now you can have one arecode in many states and one state spanning many area codes. It seems a shame to accpet this extra overhead to cover just one exception. Do you know whether the exception you have uncovered is just the tip of the iceberg, and there are many such exceptions?

Having a composite key could be problematic when you want to reference that table, since the referring table would have to have all columns the primary key has.
If that's the case, you might want to have a sequence primary key, and have the idAreaCode and idStates defined in a UNIQUE NOT NULL group.

I think it is best to add another table, countries. Your problem is an example why database normalization is important. You can't just mix and match different keys to one column.
So, I suggest you to create these table:
countries:
+------------+--------------+
| country_id | country_name |
+------------+--------------+
states:
+------------+----------+------------+
| country_id | state_id | state_name |
+------------+----------+------------+
cities
+------------+----------+---------+-----------+
| country_id | state_id | city_id | city_name |
+------------+----------+---------+-----------+
data
+------------+----------+---------+---------+----------+
| country_id | state_id | city_id | data_id | your_CSV |
+------------+----------+---------+---------+----------+
The bold fields are primary keys. Enter a standard country_id like 1 for US, 91 for india, and so on. city_id should also use their standard id.
You can then find anything belongs to each other pretty fast with minimal overhead. All data can then entered directly to data table, thus serving as one entry point, storing all the data into single spot. I don't know with mysql, but if your database support partitioning, you can partition data tables according to country_id or country_id+state_id to a couple of server arrays, thus it will also speed up your database performance considerably. The first, second, and third table won't take much hit on server load at all, and only serve as reference. You will mainly working on fourth data table. You can add data as much as you wish, without any duplicate ever again.
If you only have one data per city, you can omit data table and move CSV_data to cities table like this:
cities
+------------+----------+---------+-----------+----------+
| country_id | state_id | city_id | city_name | CSV_data |
+------------+----------+---------+-----------+----------+

If you go with adding an additional column to the key so that you can add an additional record for a given city, then you're not properly normalizing your data. Given that you've now discovered that a city can be a member of multiple states, I would suggest removing any reference to a state from the Cities table, then adding a StateCity table that allows you to relate states to cities (creating a m:m relationship).

Imtroduce a surrogate key. What are you going to do when area codes change numbets or get split? Using business keys as a primary key almost always is a mistake.
Your above summary is another example of why.

"We figured that the country code + area code combination would be unique for each city, and thus could safely be used as a primary key"
After having read this, I just stopped to read anything further in this topic.
How could someone figure it in this way?
Area codes, by definition (the first one I found on internet):
- "An Area code is the prefix numbers that are used to identify a geographical region based on the North American number Plan. This 3 digit number can be assigned to any number in North America, including Canada, The United States, Mexico, Latin America and the Caribbean" [1]
Putting aside that they are changeable and defined only in North America, the area codes are not 3-digits in some other countries (3-digits is simply not enough having hundred thousands of locations in some countries. BTW, my mother's area code has 5 digits) and they are not strictly linked to fixed geographical locations.
Area codes have migrating locations like arctic camps drifting with ice, normadic tribes, migrating military units or, even, big oceanic ships, etc.
Then, what about merging a few cities into one (or vice versa)?
[1]
http://www.successfuloffice.com/articles/answering-service-glossary-area-code.htm

I recommend adding a new primary key field to the Cities table that will be simply auto-incremental. The KISS methodology (keep it simple).
Any other solution is cumbersome and confusing in my opinion.

The database is not Normalised. It may be partly Normalised. You will find many more bugs and limitations in extensibility, as a result.
A hierarchy of Country then State then City is fine. You do not need a many-to-many additional table as some suggest. The said city (and many in America) is multiply in three States.
By placing CountryCode and AreaCode, concatenated, in a single column, you have broken basic database rules, not to mention added code on every access. Additionally, CountryCode is not Normalised.
The problem is that CountryCode+AreaCode is a poor choice for a key for a City. In real terms, it has very little to do with a city, it applies to huge swaths of land. If the meaning of City was changed to town (as in, your company starts collecting data for large towns), the db would break completely.
Magician has the only answer that is close to being correct, that would save you from your current limitations due to lack of Normalisation. It is not accurate to say that Magician's answer is Normalised; it is correct choice of Identifiers, which form a hierarchy in this case. But I would remove the "id" columns because they are unnecessary, 100% redundant columns, 100% redundant indices. The char() columns are fine as they are, and fine for the PK (compound keys). Remember you need an Index on the char() column anyway, to ensure it is unique.
If you had this, the Relational structure, with Relational Identifiers, your problem would not exist.
and your poor users do not have to figure silly things out or keep track of meaningless identifiers. They just state, naturally: State.Name, City.Name, ReadingType, Data ...
.
When you get to the lower end of the hierarchy (City), the compound PK has become onerous (3 x CHAR(20) ), and I wouldn't want to carry it into the Data table (esp if there are daily CSV imports and many readings or rows per city). Therefore for City only, I would add a surrogate key, as the PK.
But for the posted DDL, even as it is, without Normalising the db and using Relational Identifiers, yes, the PK of City is incorrect. It should be (idStates, idAreaCode), not the other way around. That will fix your problem.
Very bad naming by the way.

Related

How to structure a Bill of Materials that has multiple options

I am stuck trying to develop a Bill of Materials in Access. I have a table call IM_Item_Registry where I have the Item_Code and a boolean for if it's a component. Where I'm stuck is that past sins of the company made several part numbers for the same ingredient from different vendors. A product may use ingredient 1 at the beginning of the run and ingredient 2 at the end of a run depending on inventory and it may switch from job to job (Lack of discipline and random purchasing based on price). It's creating a headache for me because they typically have different inclusions. How would I go about adding in the flexibility to use both? or would it just be easier to make multiple versions and then select those version upon scheduling?
I know this is loaded and I can include more detail if needed but I appreciate your help I've been researching on how to do this for a couple weeks now.
EDIT (3/28/2019)
this is for an injection molding company.
IM_Item_Registry (Fields: Item_Code, Category(Raw, manufactured, customer supplied, assembly component), Description, Component (boolean), active (boolean), Unit of Measure.
for this Bill-of-materials 100011 produces component lets call this a handle. bill 100011 uses raw resin 700049 at 98% inclusion and raw color 600020 at 2% inclusion. However, we may run out of raw color 600020 and have to run it out of 600051 which would change 700049 to 98.5% inclusion because 600051 requires 1.5% inclusion to achieve the same color.
i would like to create a table that would call out for the general term lets say 600020 and 600051 is yellow color additive. then create a "ghost" number to call for either 600020 or 600051 and give both formulation recipes. When production starts they would scan in which color they actually used to create the production BOM themselves and record which color was used and how much. is there a way to do this in access database structuring?
I'm assuming I would need both the item_registry table, a BoM table (fields: BOM#, ParentID, Ghost_ID) and then a components table (Fields: Ghost_ID, item_code, Inclusion Rate).
Database normalization is the guiding principle for designing efficient, useful tables and relationships in a relational database. Access forms, subforms, reports, etc. require properly normalized tables to work as intended. There are various levels of normalization, but the common idea is to avoid duplication of data between rows and columns of data. Having duplicate data requires a lot of overhead in storage and in ensuring that actions on the database do not create inconsistent states (contradictory data values). Well-normalized tables allow useful constraints to be defined between data columns and/or rows to ensure that data is valid.
The [BoM] table as proposed in the question is not normalized. But before we get to that, the ParentID was not defined and it's not clear what it represents. Instead, to help show why it's not normalized, let me add a [Product] column to the [BoM] table. Then if such a handle has two alternative lists of components (ghosts?), the table would look like
BOMID, Product, GhostID
----- ------- -------
1 Handle 1
1 Handle 2
See the duplication? And now if the product is renamed, for instance to "Bronze Handle", then both rows need to be updated for a single conceptual element. It also introduces the possibility of having contradictory data like
BOMID, Product, GhostID
----- ------- -------
1 Handle 1
1 Bronze Handle 2
Enough said about that, since I've already gone on too much about normalization concepts here. Following is a basic normalized schema which would serve you better, but notice that it's not too much different that what you proposed in the question. The only real difference is that the BoM table is normalized by splitting its columns (and purpose) into another table.
I do not list all columns here, only primary and foreign keys and a few other meaningful columns. PK = Primary Key (unique, non-null key), FK = Foreign Key. Proper indices should be defined on the PK and FK columns AND relationships defined with appropriate constraints.
Table: [IM_Item_Registry]
Item_Code (PK)
Table: [BOM]
BOMID (PK)
ProductID (FK)
Table: [BOM_Option]
OptionID (PK)
BOMID (FK)
Primary (boolean) - flags the primary/usual list of components
Description
Table: [Option_Items]
OptionID (FK; part of composite PK)
Item_Code (FK; part of composite PK)
Inclusion_Rate
The [BOM].[ProductID] column alludes to another table with details of the product which should be defined separately from the Bill of Material. If this database really is super-simplistic, then it could just be a string field [Product] containing the name, but I assume there are more useful details to store. Perhaps this is what the ParentID also alluded to? (I suggest choosing names that are not so abstract like "parent" and "ghost", hence my choice of the word "option".)
Really, since [BOM_Option] should be limited to a single option per BOM, it would fulfill proper normalization to create another table like
Table: [BOM_Primary]
[BOMID] (FK and PK) - Primary key so only one primary option can be defined at once
[OptionID] (FK)

Normalize two tables with same primary key to 3NF

I have two tables currently with the same primary key, can I have these two tables with the same primary key?
Also are all the tables in 3rd normal form
Ticket:
-------------------
Ticket_id* PK
Flight_name* FK
Names*
Price
Tax
Number_bags
Travel class:
-------------------
Ticket id * PK
Customer_5star
Customer_normal
Customer_2star
Airmiles
Lounge_discount
ticket_economy
ticket_business
ticket_first
food allowance
drink allowance
the rest of the tables in the database are below
Passengers:
Names* PK
Credit_card_number
Credit_card_issue
Ticket_id *
Address
Flight:
Flight_name* PK
Flight_date
Source_airport_id* FK
Dest_airport_id* FK
Source
Destination
Plane_id*
Airport:
Source_airport_id* PK
Dest_airport_id* PK
Source_airport_country
Dest_airport_country
Pilot:
Pilot_name* PK
Plane id* FK
Pilot_grade
Month
Hours flown
Rate
Plane:
Plane_id* PK
Pilot_name* FK
This is not meant as an answer but it became too long for a comment...
Not to sound harsh, but your model has some serious flaws and you should probably take it back to the drawing board.
Consider what would happen if a Passenger buys a second Ticket for instance. The Passenger table should not hold any reference to tickets. Maybe a passenger can have more than one credit card though? Shouldn't Credit Cards be in their own table? The same applies to Addresses.
Why does the Airport table hold information that really is about destinations (or paths/trips)? You already record trip information in the Flights table. It seems to me that the Airport table should hold information pertaining to a particular airport (like name, location?, IATA code et cetera).
Can a Pilot just be associated with one single Plane? Doesn't sound very likely. The pilot table should not hold information about planes.
And the Planes table should not hold information on pilots as a plane surely can be connected to more than one pilot.
And so on... there are most likely other issues too, but these pointers should give you something to think about.
The only tables that sort of looks ok to me are Ticket and Flight.
Re same primary key:
Yes there can be multiple tables with the same primary key. Both in principle and in good practice. We declare a primary or other unique column set to say that those columns (and supersets of them) are unique in a table. When that is the case, declare such column sets. This happens all the time.
Eg: A typical reasonable case is "subtyping"/"subtables", where entities of a kind identified by a candidate key of one table are always or sometimes also of the kind identifed by the same values in another table. (If always then the one table's candidate key values are also in the other table's. And so we would declare a foreign key from the one to the other. We would say the one table's kind of entity is a subtype of the other's.) On the other hand sometimes one table is used with attributes of both kinds and attributes inapplicable to one kind are not used. (Ie via NULL or a tag indicating kind.)
Whether you should have cases of the same primary key depends on other criteria for good design as applied to your particular situation. You need to learn design including normalization.
Eg: All keys simple and 3NF implies 5NF, so if your two tables have the same set of values as only & simple primary key in every state and they are both in 3NF then their join contains exactly the same information as they do separately. Still, maybe you would keep them separate for clarity of design, for likelihood of change or for performance based on usage. You didn't give that information.
Re normal forms:
Normal forms apply to tables. The highest normal form of a table is a property independent of any other table. (Athough you might choose that form based on what forms & tables are alternatives.)
In order to normalize or determine a table's highest normal form one needs to know (in general) all the functional dependencies in it. (For normal forms above BCNF, also join dependencies.) You didn't give them. They are determined by what the meaning of the table is (ie how to determine what rows go in it in any given situation) and the possible situtations that can arise. You didn't give them. Your expectation that we could tell you about the normal forms your tables are in without giving such information suggests that you do not understand normalization and need to educate yourself about it.
Proper design also needs this information and in general all valid states that can arise from situations that arise. Ie constraints among given tables. You didn't give them.
Having two tables with the same key goes against the idea of removing redundancy in normalization.
Excluding that, are these tables in 1NF and 2NF?
Judging by the Names field, I'd suggest that table1 is not. If multiple names can belong to one ticket, then you need a new table, most likely with a composite key of ticket_id,name.

How can I reference an intersection table of a many-to-many relationship?

I am creating a pharmacy database that handles prescriptions. When designing the database, I took into consideration that doctors can work at many offices, and an office can be home to many doctors, so I created the following many to many relationship:
doctor:
| id | name | // more
office:
| id | name | address | // more
doctors_offices:
| doctor_id | office_id |
I followed the design I've seen in my database textbooks as well as many online resources, but I'm now running into a little confusion when trying to create a prescription table. In this table I want to know not only which doctor wrote the prescription, but at which location.
I find myself having a few options:
Add an auto_increment key to the doctors_offices table to have a unique identifier for each dr/office pairing
Add a composite foreign key to the prescription table that references doctors_offices. (Is this possible?)
Add two foreign keys to prescription table. One that references doctors and one that references offices.
Which of these options is most normalized? I know that the third one is likely the least normalized, as it opens up the possibility that I select an office that a doctor does not belong to, but I felt important to mention as it might be a common instinct among some beginner database designers.
It is more flexible if you save the doctor_id as well as the location_id along with the prescriptions data. Otherwise, you will run into difficulties if a doctor moves to another office after already having given some prescriptions, unless you make a new entry for every time period a doctor was present in a certain office. Furthermore, you won't be able to properly depict the situation when, exceptionally, another doctor is present in the office and makes a prescription despite he or she has no office assignment.
People tend to become ingenious if the system prevents them from inserting data for a technical reason, you should always consider such exceptions, because they will occur.

Database design issue regarding identifying relationships and many to many relationships

I have a weird database design issue that I'm not sure if I'm doing this right or not. Since my current design is really complicated, I've simplified it in the following diagram with a comparison using houses and occupants (not my actual entities).
So, here is what part of the database design looks like:
Standard Conditions:
Multiple houses
Multiple floors per house
Multiple bedrooms per floor
Not-so-standard Conditions:
Each occupant can live in multiple houses
Each occupant can have multiple bedrooms per house
Each occupant can only have one bedroom per floor, per house (this is the tricky part) For example, they can have one bedroom on floor 1, one bedroom on floor 2 and one bedroom on floor 3, but never two bedrooms on the same floor
Thus, what I'm trying to accomplish is this. In the app design, I know the house, I know the floor and I know the occupant. What I need to find out with this information without the user specifying is what bedroom the occupant has based on those 3 criteria. There are two solutions. The first is that in the occupants_has_bedrooms table, I make the primary key the occupants_id, bedrooms_floors_id and the bedrooms_floors_houses_id. However, when I take away bedrooms_id from the primary key, the table is no longer an identifying relationship to the parent (bedrooms). It is an identifying relationship though because it couldn't exist without the parent. Therefore, something tells me I need to keep all four ids as the primary key. My second option is a unique index between those three values, however this is when I considered I may be approaching this wrong.
How do I accomplishing this?
Here's a general database design strategy that is not specific to MySQL but should still be helpful.
It's good that you know how you are going to query your data, but don't let that overly affect your model (at least at first).
The first thing to be clear on is what is the PK for each table? It looks you are using composite keys for floors and bedrooms. If you used an informationless key (ID column per table) strategy for all tables except your intersection table Occupants_has_bedrooms, it would makes your joins simpler. I'm going to assume you can, so here's how to go from there:
The first thing I would change is to get rid of floors_house_id column in bedrooms - this is now redundant and can be gotten from a join.
Next, make the following changes to occupants_has_bedrooms:
The PK for should only be two columns, occupants_id and bedroom_id. (why? Because a primary key should only contain enough info to uniquely identify a row).
Remove the bedrooms_floors_houses_id, as that's determined by bedrooms_floors_id and is not needed.
add a unique constraint on (occupants_id, bedrooms_floors_id) to enforce your "not so standard" conditions.
Finally, do an inner join with all tables except Occupants, add your three conditions in the WHERE clause. This should get you the result you want. If you really want the composite keys, you can still do it cut it gets messy. Sorry I'm not near an editor or I'd diagram it for you.
I would design the database reverse of what u did.
House
id
name
Floors -- Many to many
Floor
id
name
Bedrooms -- Many to many
optional: you can have a back pointer to house
Bedroom:
id
name
Occupants -- many to many
optional : back pointer to floor
Occupant:
id
name
optional : back pointer to Bedroom
Now having this many to many table you can query your conditions rather easily.

Designing a table in mysql

Mysql , php newbie here. Please be nice .
I have a list of colonies - colony1 , colony2,..., colony100.
Say Colony5 is nearby to colony4, colony9 and colony10.
Colony4 is nearby to colony5, colony9, colony10 and colony11 . Different colonies have different number of nearby colonies.
How do I store and fetch this data in mysql ?
Currently I am thinking a table that would look like this ->
id | colony name | nearby_colony_id_1 |nearby_colony_id_2 |nearby_colony_id_3 | nearby_colony_id_4
Is there a better way of doing this ?
I hope my question was clear.
Designing tables is fun!
If you want to store data where things are "nereby," you can do it with a graph (data structure). If you don't care about the specifics of nearby and only that a colony is nearby, you can just do this with another table. Do not do what you are trying to do (id_2, id_3, etc.) This is not a normalized DB and can lead to anomalies. A lot of people make this mistake the first time.
CREATE TABLE
Colonies (`id` int unsigned NOT NULL auto_increment, `name` varchar(255));
CREATE TABLE
Nearby_colonies (`a_id` int unsigned NOT NULL, `b_id` int unsigned NOT NULL);
So you have your colonies, in the colonies table. Then, for every nearby colony to that colony, you have an entry in Nearby_colonies that has a pair of IDs (order should not matter, but the names have to be different). This links the two colonies as Nearby colonies. Now one colony can have as many Nearby_colony entries as it likes instead of being limited to id_2, id_3, id_4, etc.
This also prevents anomalies from occurring because the relationship itself is stored for each colony, not just for one colony about another.
If you want to get even more specific and store the distance between the two colonies, no problem! Just add another field to Nearby_colonies to do that. Obviously the distance from colony a to colony b is the same in either direction ;P
i guess that you already have the answer, but it's important to understand the idea behind the solution so in the future you could do it yourself
You have a class named colonies that is related to itself (e.g "a colonie is nearby multiple colonies") in a many-to-many relationship. That's called NxN recursive relationship. in a simple UML diagram it'd be something like
now that you have the objects model, it'd be easier to create the db tables. A NxN relationship could be interpreted as an intermediary table that contains both ids as the primary key. In this case we'll need a table that i'll call tbl_nearby
witch is basically #tandu's answer, but i prefer using foreign keys because it will preserve the data integrity.
its a good idea to use UML objects model and then translate that model into database tables because that way it'd be easier for you to design really complex models
Good luck
You need two tables. The first will hold the colony specifications (id, name, ...), the second will hold links between colonies. In the first table you will have one row per colony.
Ex:
Id Name
1 colony1
2 colony2
3 colony3
4 colony4
In the second table, you will have a column for the id of the colony, a column for the id of the nearby colony. In this table, you will have one row per colony neighbour.
Ex (here colony1 has colony 2 and 3 as nearby colonies):
ColonyId NeighbourId
1 2
1 3
You can use foreign key to ensure that the colony referenced in table 2 does exist in table 1.