For this example, I'm trying to build a system that will allow output from multiple sources, but these sources are not yet built. The output "module" will be one component, and each source will be its own component to be built and expanded upon later.
Here's an example I designed in MySQLWorkbench:
The goal is to make my output module display data from the output table while being easily expanded upon later as more sources are built. I also want to minimize schema updates when adding new sources. Currently, I will have to add a new table per source, then add a foreign key to the output table.
Is there a better way to do this? I don't know how I feel about these NULL-able foreign keys because the JOIN query will contains IFNULL's and will get unruly quickly.
Thoughts?
EDIT 1: Clarification
I will be displaying a grid using data in the output table. The output table will contain general data for all items in the grid and will basically act as an aggregator for the output_source_X tables:
output(id, when_added, is_approved, when_approved, sort_order, ...)
The output_source_X tables will contain additional data specific to a source. For example, let's say one of the output source tables is for Facebook posts, so this table will contain columns specific to the Facebook API:
output_source_facebook(id, from, message, place, updated_time, ...)
Another may be Twitter, so the columns are specific for Twitter:
output_source_twitter(id, coordinates, favorited, truncated, text, ...)
A third output source table could be Instagram, so the output_source_instagram table will contain columns specific to Instagram.
There will be a one-to-one foreign key relationship with the output table and ONLY ONE of the output_source_X tables, depending on if the output item is a Facebook, Twitter, Instagram, etc... post, hence the NULL-able foreign keys.
output table
------------
foreign key (source_id_facebook) references output_source_facebook(id)
foreign key (source_id_twitter) references output_source_twitter(id)
foreign key (source_id_instagram) references output_source_instagram(id)
I guess my concern is that this is not as modular as I'd like it to be because I'd like to add other sources as well without having to update the schema much. Currently, this requires me to join output_source_X on the output table using whatever foreign key is not null.
This design in almost certainly bad in a few ways.
It's not that clear what your design is representing but a straightforward one would be something like:
// source [id] has ...
source(id,message,...)
// output [id] is approved when [approved]=1 and ...
output(id,approved,...)
// output [output_id] has [source_id] as a source
output_source(output_id,source_id)
foreign key (source_id) references source(id)
foreign key (source_id) references source(id)
Maybe you have different subtypes of outputs and/or sources? Based on sources and/or outputs? Maybe each source is restricted to feeding particular outputs? Are "outputs" and "sources" actually kinds of outputs and sources, and this is info not on how outputs are sourced but info on what kinds of output-source pairings are permittted?
Please give us statements parameterized by column names for the basic statements you want to make about your application. Ie for the application relationships you are interested in. (Eg like the code comments above.) (You could do it for the diagrammed design but probably that would be overly complicated and not really reflecting what you are trying to model.)
Re your EDIT:
There will be a one-to-one foreign key relationship with the output
table and ONLY ONE of the output_source_X tables, depending on if the
output item is a Facebook, Twitter, Instagram, etc... post, hence the
NULL-able foreign keys.
You have a case of multiple disjoint subtypes of a supertype.
Your situation is a lot like that of this question except that where they have a subtype discriminator/tag column indicating which subtype table you have a set of columns where the non-empty one indicates which subtype table. See Erwin Smout's & my answers. Also this answer.
Please give us statements parameterized by column names for the basic
statements you want to make about your application
and you will find straightforward statements (as above). And if you give the statements for your current design you will find them complex. See also this.
I guess my concern is that this is not as modular as I'd like it to be
because I'd like to add other sources as well without having to update
the schema much.
Your structure is not reducing schema changes compared to proper subtype designs.
Anyway, DDL is there for that. You can genericize subtypes to avoid DDL only by loss of the DBMS managing integrity. That would only be relevant or reasonable based on evaluating DDL vs DML performance tradeoffs. Search re (usually, anti-pattern) EAV.
(Only after you shown that creating & deleting new tables is infeasible and the corresponding horrible integrity-&-concurrency-challenged mega-joining table-and-metadata-encoded-in-table EAV information-equivalent design is feasible should you consider using EAV.)
Related
I can't find a term for what I'm trying to do so that may be limiting my ability to find info related to my question.
I'm trying to relate product identifiers and product processing codes (orange table in fig.) with validation against what product types and subtypes are valid for each process code based on process type. Importantly, each product identifier is related to a product type (see ProductIdentifier table) and each process code is related to process type (see ProcessCode table). I minimized the attributes in the tables below to only those necessary for my question.
In the above example, when I INSERT INTO the RunProcessTypeOne table, I need to validate that the ProductCode for RoleOneProductIdentifier is present in ProductTypeTwo. Similarly, I need to validate that the ProductCode for RoleTwoProductIdentifier is present in ProductSubtypeOne.
Of course I can use a stored procedure that inserts into the RunProcessTypeOne table after running SELECT to check for the presence of the ProductCode related to RoleOneProductIdentifier and RoleTwoProductIdentifier in the relevant tables. This doesn't seem optimal since I'm having to run three SELECTs for every INSERT. Plus, it seems fishy that the relationship between ProcessTypes and ProductCodes would only be known within the stored procedure and not via relationships established between the tables themselves (foreign key).
Are there alternatives to this approach? Is there a standard for handling this type of validation where you need to validate individual instances (e.g. ProductIdentifiers) of entity types based on the relationships between those types (e.g. the relationship between ProductTypeTwo and ProcessTypeOne)?
If more details are helpful: The relationship between ProductCode and ProcessCode is many-to-many but there are rules that define product roles in each process and only certain product types or subtypes may fulfill those roles. ProductTypeOne might include attributes that define a specific kind of product like color or shape. ProductIdentifier includes the many lots of any ProductCode that are manufactured. ProcessCode includes settings that are put on a machine for processing. ProductType by way of ProductCode determines if a ProductIdentifier is valid for a particular ProcessType. Individual ProcessCodes don't discriminate valid ProducIdentifiers, only the ProcessType related to the ProcessCode would discriminate.
it seems fishy that the relationship between ProcessTypes and ProductCodes would only be known within the stored procedure and not via relationships established between the tables themselves (foreign key).
Yes that's an important observation, good to see you questioning the current schema. The fact of the matter is that SQL is not very powerful when it comes to representing data structures. So often a stored procedure is the only/least worst approach.
I'll make a suggestion for how to achieve this without stored procedures, but I won't call it "optimal": there's likely to be a performance hit for INSERTs (and worse for UPDATEs), because the SQL engine will probably be in effect carrying out the same SELECTs as you'd code in a stored procedure.
Split table ProductIdentifier into two:
ProductIdentifierTypeTwo PK ProductIdentifier, ProductCode FK REFERENCES ProductTypeTwo.ProductCode.
ProductIdentifierTypeOne PK ProductIdentifier, ProductCode FK REFERENCES ProductTypeOne.ProductCode.
Also CREATE VIEW ProductIdentifier UNION the two sub-tables, PK ProductIdentifier. This makes sure ProductIdentifier isn't duplicated between the two types.
IOW this avoids the ProductIdentifier table directly referencing the ProductCode table, where it can only examine ProductType as a column value, not as a referential structure.
Then
RunProcessTypeOne.RoleOneProductIdentifier FK REFERENCES ProductIdentifierTypeTwo.ProductIdentifier.
RunProcessTypeOne.RoleTwoProductIdentifier FK REFERENCES ProductIdentifierTypeOne.ProductIdentifier.
Making the original ProductIdentifier a VIEW is the least non-optimal way to manage updates (I'm guessing from your comment): ProductIdentifiers are less volatile than RunProcesses.
Re your more general question:
Is there a standard for handling this type of validation where you need to validate individual instances (e.g. ProductIdentifiers) of entity types based on the relationships between those types (e.g. the relationship between ProductTypeTwo and ProcessTypeOne)?
There are facilities included in the SQL standard. Most vendors haven't implemented them, or only partially support them -- essentially because implementing them would need running SELECTs with tricky logic as part of table updates.
You should be able to CREATE VIEW with a filter to only the rows that are the target of some FK.
(Your dba is likely to object that VIEWs come with an unacceptable performance hit. In this example, you'd have a single ProductIdentifier table, with the two sub-tables I suggest above as VIEWs. But maintaining those views would need joining to ProductCode to filter by ProductType.)
Then you should be able to define a FK to the VIEW rather than to the base table.
(This is the bit many SQL vendors don't support.)
I have a JSON file that stores the information about a bunch of recipes, like cuisine, time, the ingredients, instructions, etc. I am supposed to transfer all the data to a MySQL table with the relevant headings.
The "ingredients" and the "instructions" are stored like this:
The instructions and ingredients have several "lines" , stored as a list.
How can we store the ingredients and instructions in a MySQL table, in a line by line format?
something like:
instructions
inst1
inst2
..
The JSON file was created using a python program using the beautiful soup module.
PS: I am very new to both SQL and JSON, so I unfortunately dont have anything to show under "what I tried"...Any help will be appreciated.
Rather than give you the exact answer, I'll give you the process I use to determine a database structure. You're using a relational database, so that's what I'll talk about. Its also good to have a naming convention, I've used CamelCase here but you can do whatever you want.
You mentioned you were using python but this answer is language agnostic.
You've chosen quite a complex example, but I'll assume you understand how to create a table, and use primary keys and foreign keys. If not, maybe you should do something simpler.
Step 1 - Figure out what the entities are
These are the real-life entities which need to represented as database tables. In this case, I'm seeing 4 entities;
Recipe
Keyword
Ingredient
Instruction
Each of these can have a table in MySql. Give them a Primary key which follows a naming convention.
Step 2 - figure out the relationships
It looks like keywords are shared between multiple recipes, so you'll a many to many relationship - this means there's going to be an extra table,
RecipeKeyword
This is just a link between Recipe and keyword to avoid redundancy. It has two foreign keys, RecipeId and KeywordId. At the moment its just a dumb object. In other situations like this, its common for an application to need information about a join - for example, who linked the two things together (consider users, permissions, and a join table with information on who granted the permission)
The other entities are one to many - each will need a foreign key, RecipeId
Step 3 - design each table
As well as having several lists, your Recipe object has some properties. These can be in its table. Most of them are strings in your data, although there are better ways to store things we can keep this simple.
The other entities just have a text field, from your screenshot, only the Recipe has properties.
For this system, you'll need to first insert all Recipe and Keyword objects. There is a common pattern in relational databases where in insert a record, and get its ID so you can insert more stuff which references it.
Step 4 - find a python mysql library
I don't know of one but google will help you find it. The documentation should include the basics of querying.
Step 5 - Insert your data
Here is some psudocode
FOR EACH recipe
INSERT the recipe, and get its ID
FOR EACH keyword
IF the keyword does not exist already
INSERT the new keyword and get its ID
INSERT a record into RecipeKeyword with RecipeId and KeywordId
FOR EACH ingredient
INSERT the ingredient, give it RecipeId as a foreign key
FOR EACH instruction
INSERT the instruction, give it RecipeId as a foreign key
That's it. From here you can select with joins - To form what we're seeing above, you might need to do 3 seperate queries and merge them together into a record object on the python side to reproduce the original structure.
I have one question regarding database design.
Here is the first example:
User may have a multiple Websites, and user can request specific resource for every of his websites. All requests are saved in RequestForResource table.
Now, if I want to see the name of an user who requested a resource, I have to join tables RequestForResource Website and table User.
To avoid this, I can make foreign key between RequestForResource and User table like it is demonstrated here:
Now, in order to get an user name, I have to join table RequestForResource and table User which is probably easier for SQL server, but at the other hand I have one foreign key more.
Which approach is better and (or) faster and why?
You can always duplicate information to gain execution speed. This is called: denormalisation. Yes, it will probably speed up the queries by lowering the required count of index seeks.
BUT
You have to write your code to make sure, that the data is consistent:
With the second design it is possible, to insert Website.User_idUser and a RequestForResource.User_idUser with different IDs for the same site! According to the design this is valid (but probably this will not satisfy your business rules).
Consider to update the foreign key constraint (or add a second one) which refers only to the Website table (User_idUser, Website_idWebsite) and remove the User-RequestForResource one.
Also consider to build a view to query your data with all the required info (probably with a clustered index).
We are currently in the process of developing our own e-commerce solution, as part of our research we have been examining the ZenCart Database Schema and found that data is quite frequently duplicated between various tables where it would seem that perhaps a Foreign Key would have been sufficient to link the two or more tables in question, for example:
Given that there is table "Products" that has the following columns
PRODUCT_IDPRODUCT_NAMEPRODUCT_PRICEPRODUCT_SKU
Then if there is a Sales_Item "Table" Then of course a product (and all its constituent columns)may be referenced by simply doing something like:
SALES_ITEM_IDProducts_PRODUCT_ID //This is the foreign key that relates a specific product to a sale item.SALE_TIMEREST_OF_SALE_SPECIFIC_DATA......
However instead it seems that the Sales table COPIES many of the field values defined in the Products table so it infact looks as follows:
SALES_ITEM_IDPRODUCT_IDPRODUCT_NAMEPRODUCT_PRICEPRODUCT_SKUSALE_TIME
My question is which approach would generally be considered best practice when attempting to build a scalable efficient solution. Using foreign keys means data is not duplicated but the caveat is that database or application-level JOINS would be needed in order to query the entire dataset. However than being said, for some reason the foreign key approach seems cleaner and more correct somehow.
I have done quite a lot of research and I believe that my database is in the 4th NF (was told that there is no need to go any further) but something still feels wrong.
I have a table TRUNK to which two tables refer via a foreign key: RATECARD as one trunk can be used in many ratecards (the differentiation being times when valid, callplans etc); furthermore I have a RATEBUYINGINFO which is basically info that you download from the trunk providers and contains info on rates to different destinations and similar. Obviously more RATEBUYINGINFO objects can be associated with one trunk as the price change over time, but the RATEBUYINGINFO and RATECARD are in no direct connection except that they may refer to a single trunk, so I have TrunkID as foreign key in both these tables.
Then I have the info with the selling rates (RATESELLINGINFO table) based on certain RATECARD and also destination info together with the trunk info all of which is kept track of in the RATEBUYINGINFO table (and no, I don't see the point in singling out DESTINATION as a separate table as different trunks by different providers do not provide unique destination names) so I have foreign keys RateCardID and RateBuyingInfoID as foreign keys in RATESELLINGINFO table.
Now the problem is that via these two foreign key the last table has access to two TrunkID values (one in RATECARD and one in RATEBUYINGINFO) which should always be the same (obviously one selling rate refers to a single trunk) but the database architecture won't guarantee that in any way.
Is there an elegant solution to this problem?
When you ask questions like this, always include SQL CREATE TABLE statements and some sample data as SQL INSERT statements. SQL is much more reliable and less ambiguous than your comments. (You can edit your question and add that stuff now to get better answers from people who read this later.)
The trunk id in both the tables RATECARD and RATEBUYINGINFO should probably be part of the primary key or part of a unique constraint in both those tables. If it is, then you can store trunk id once in RATESELLINGINFO with overlapping foreign key constraints. Something like
...
foreign key (trunk_id, rate_card_id)
references ratecard (trunk_id, rate_card_id),
foreign key (trunk_id, rate_buying_info_id)
references rate_buying_info (trunk_id, rate_buying_info_id)
...
Trunk id would have ended up in RATESELLINGINFO anyway (probably) if you'd done a full relational model.
Additional tip: drop the word "info" from your table names. All tables contain info; adding that to the name is just noise.