How can I best maintain integrity between two columns in a table? - mysql

Hypothetically, I have an ENUM column named Category, and an ENUM column named Subcategory. I will sometimes want to SELECT on Category alone, which is why they are split out.
CREATE TABLE `Bonza` (
`EventId` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`Category` ENUM("a", "b", "c") NOT NULL,
`Subcategory` ENUM("x", "y", "z") NOT NULL,
PRIMARY KEY(`EventId`)
) ENGINE=InnoDB;
But not all subcategories are valid for all categories (say, "z" is only valid with "a" and "b"), and it irks me that this constraint isn't baked into the design of the table. If MySQL had some sort of "pair" type (where a column of that type were indexable on a leading subsequence of the value) then this wouldn't be such an issue.
I'm stuck with writing long conditionals in a trigger if I want to maintain integrity between category and subcategory. Or am I better off just leaving it? What would you do?
I suppose the most relationally-oriented approach would be storing an EventCategoryId instead, and mapping it to a table containing all valid event type pairs, and joining on that table every time I want to look up the meaning of an event category.
CREATE TABLE `Bonza` (
`EventId` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`EventCategoryId` INT UNSIGNED NOT NULL,
PRIMARY KEY(`EventId`),
FOREIGN KEY `EventCategoryId` REFEFRENCES(`EventCategories`.`EventCategoryId`)
ON DELETE RESTRICT ON UPDATE CASCADE
) ENGINE=InnoDB;
CREATE TABLE `EventCategories` (
`EventCategoryId` INT UNSIGNED NOT NULL,
`Category` ENUM("a", "b", "c") NOT NULL,
`Subcategory` ENUM("x", "y", "z") NOT NULL,
PRIMARY KEY(`EventCategoryId`)
) ENGINE=InnoDB;
-- Now populate this table with valid category/subcategory pairs at installation
Can I do anything simpler? This lookup will potentially cost me complexity and performance in calling code, for INSERTs into Bonza, no?

Assuming that your categories and subcategories don't change that often, and assuming that you're willing to live with a big update when they do, you can do the following:
Use an EventCategories table to control the hierarchical constraint between categories and subcategories. The primary key for that table should be a compound key containing both Category and Subcategory. Reference this table in your Bonza table. The foreign key in Bonza happens to contain both of the columns that you want to filter by, so you don't need to join to get what you're after. It will also be impossible to assign an invalid combination.
CREATE TABLE `Bonza` (
`EventId` UNSIGNED INT NOT NULL AUTO_INCREMENT,
`Category` CHAR(1) NOT NULL,
`Subcategory` CHAR(1) NOT NULL,
PRIMARY KEY(`EventId`),
FOREIGN KEY `Category`, `Subcategory`
REFEFRENCES(`EventCategories`.`Category`, `EventCategories`.`Subcategory`)
ON DELETE RESTRICT ON UPDATE CASCADE
) ENGINE=InnoDB;
CREATE TABLE `EventCategories` (
`EventCategoryId` UNSIGNED INT NOT NULL,
`Category` CHAR(1) NOT NULL,
`Subcategory` CHAR(1) NOT NULL,
PRIMARY KEY(`Category`, `Subcategory`)
) ENGINE=InnoDB;

My thought is: "best" is almost always opinion-based, but still there are some common things that may be said
Using relational structure
Once you have an issue that not all pairs are valid - you have an issue - that you must store this information. Therefore, you need either to store which pairs are invalid or to store which pairs are valid. Your sample with additional table is completely valid in terms of relational DBMS. In fact, if we'll face such issue, it is near the only way to resolve it on database-design level. With it:
You're storing valid pairs. That's as I've said: you have to store this information somewhere and here we are - creating new table
You're maintaining referential integrity via FOREIGN KEY. So your data will always be correct and point to valid pair
What bad things may happen and how could this impact the performance?
To reconstruct full row, you'll need to use simple JOIN:
SELECT
Bonza.id,
EventCategories.Subcategory,
EventCategories.Category
FROM
Bonza
LEFT JOIN EventCategories
ON Bonza.EventCategoryId=EventCategory.id
Performance of this JOIN will be good: you'll do it be FK - thus, by definition, you'll get only INDEX SCAN. It is about index quality (i.e. it's cardinality) - but in general it will be fast.
How complex is one JOIN? It's simple operation - but it may add some overhead to complex queries. However, in my opinion: it's ok. There's nothing complex in it.
You are able to change pairs with a simple changing of EventCategories data. That is: you can easily remove restrictions on prohibited pairs and this will affect nothing. I see this as a great benefit of this structure. However, adding new restriction isn't so simple - because, yes, it requires DELETE operation. You've chosen ON DELETE RESTRICT action for your FK - and that means you'll have to handle all conflicting records before adding new restriction. This depends, of course, from your application's logic - but think of it another way: if you'll add new restriction, shouldn't then all conflicting records be removed (because logic is saying: yes, they should)? If so, then change your FK to ON DELETE CASCADE.
So: having additional table is simple, flexible and actually easy way to resolve your issue.
Storing in one table
You've mentioned, that you can use trigger for your issue. And that is actually applicable, so I'll show - that this has it's weakness (well, together with some benefits). Let's say, we'll create the trigger:
DELIMITER //
CREATE TRIGGER catCheck BEFORE INSERT ON Bonza
FOR EACH ROW
BEGIN
IF NEW.Subcategory = "z" && NEW.Category = "c" THEN
SIGNAL SQLSTATE '45000' SET MESSAGE_TEXT = 'Invalid category pair';
END IF;
END;//
DELIMITER ;
Obviously, we still have to store information about how to validate our pairs, but in this case we store invalid combinations. Once we'll get invalid data, we'll catch this inside trigger and abort our insert, returning proper user-defined errno (45000) together with some explanation text. Now, what about complexity and performance?
This way allows you to store your data as it is, in one table. This is a benefit: you'll get rid of JOIN - integrity is maintained by another tool. You may forget about storing pairs and handling them, hiding this logic in the trigger
So, you'll win on SELECT statements: your data always contain valid pairs. And no JOIN would be needed
But, yes, you'll loose on INSERT/UPDATE statements: they will invoke trigger and within it - some checking condition. It may be complex (many IF parts) and MySQL will check them one by one. Making one single condition wouldn't help lot - because still, in worst case, MySQL will check it till it's end.
Scalability of this method is poor. Every time you'll need to add/remove pair restriction - you'll have to redefine trigger. Even worse, unlike JOIN case, you'll not able to do any cascade actions. Instead you'll have to do manual handling.
What to chose?
For common case, if you don't know for certain - what will be your application conditions, I recommend you to use JOIN option. It's simple, readable, scalable. It fits relational DB principles.
For some special cases, you may want to chose second option. Those conditions would be:
Allowed pairs will never be changed (or will be changed very rare)
SELECT statements will be done much, much more often, then INSERT/UPDATE statements. And also SELECT statement performance will be in highest priority in terms of performance for your application.

I'd liked this problem but, with this information I would define a set of valid pairs for just one enum column:
CategorySubcategory ENUM("ax", "ay", "az", "bx", "by", "bz", "cx", "cy")
I think this will only be useful with a limited set of values, when they got bigger personally I would choose your second option rather than the triggered one.
First reason is absolutely an opinion, I don't like triggers too much, and they don't like me
Second reason is that a well indexed and properly sized reference from one table to another has a really high performance

Related

Is there a data type in MySQL that is similar to a dynamic array in C or C++?

I want to structure a MySQL table with many of the the usual columns such as index as an integer, name as a varchar, etc. The thing about this table is I want to include a column that has an unknown number of entries. I think the best way to do this (if possible) is to make one of the columns an array that can be changed as any entry in a database can. Supposing when the record is created it has 0 entries. Then later, I want to add 1 or more. Maybe sometime later still, I might want to remove 1 or more of these entries.
I know I could create the table with individual columns for each of these additions, but I may want as many as a hundred or more for one record. This seems very inefficient and very difficult to maintain. So the bottom-line question is can a column be defined as a dynamic array? If so, how? How can things be selectively added to or removed from it?
I'll take a stab in the dark and guess maybe make a table contain another table. I've never heard of this because my experience with MySQL has been mostly casual. I make databases and dynamic websites because I want to.
The way to do this in a relational database is to create another table. One column of that table will have a foreign key pointing to the primary key of that table that should have had the array (or multiple columns, if the primary key consists of more than one row). Another column has the values that'd be found in the array. If order matters, a third column would store some other values indicating the ordinality.
Something along the lines of:
CREATE TABLE elbat_array
(id integer,
elbat integer -- or whatever type the primary key column has
NOT NULL,
value text, -- or whatever type the values should have
ordinality integer
NOT NULL, -- optional
PRIMARY KEY (id),
FOREIGN KEY (elbat)
REFERENCES elbat -- the other table
(id) -- and its primary key column
ON DELETE CASCADE,
UNIQUE (ordinality));
To add to the "array", insert rows into that table. To remove, delete rows. There can be as many as zero rows (i.e. "array" elements) or as much as there's disk space (unless you hit any limit of the DBMS before, but if such a limit applies it would be very large, so usually that should not be a problem).
Also worth a read in that context: "Is storing a delimited list in a database column really that bad?" While it's not about an array type in particular, on the meta level it discusses why the values in a column should be atomic. An array would violate that as well as a delimited list does.

Auto-increment a primary key in MySql

During the creation of tables using mysql on phpmyadmin, I always find an issue when it comes to primary keys and their auto-increments. When I insert lines into my table. The auto_increment works perfectly adding a value of 1 to each primary key on each new line. But when I delete a line for example a line where the primary key is 'id = 4' and I add a new line to the table. The primary key in the new line gets a value of 'id = 5' instead of 'id = 4'. It acts like the old line was never deleted.
Here is an example of the SQL statement:
CREATE TABLE employe(
id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(30) NOT NULL
)
ENGINE = INNODB;
How do I find a solution to this problem ?
Thank you.
I'm pretty sure this is by design. If you had IDs up to 6 in your table and you deleted ID 2, would you want the next input to be an ID of 2? That doesn't seem to follow the ACID properties. Also, if there was a dependence on that data, for example, if it was user data, and the ID determined user IDs, it would invalidate pre-existing information, since if user X was deleted and the same ID was assigned to user Y, that could cause integrity issues in dependent systems.
Also, imagine a table with 50 billion rows. Should the table run an O(n) search for the smallest missing ID every time you're trying to insert a new record? I can see that getting out of hand really quickly.
Some links you might like to read:
Principles of Transaction-Oriented Database Recovery (1983)
How can we re-use the deleted id from any MySQL-DB table?
Why do you care?
Primary keys are internal row identifiers that are not supposed to be sexy or good looking. As long as they are able identify each row uniquely, they serve their purpose.
Now, if you care about its value, then you probably want to expose the primary key value somewhere, and that's a big red flag. If you need an external, visible identifier, you can create a secondary column with any formatting sequence and values you want.
As a side note, the term AUTO_INCREMENT is a bit misleading. It doesn't really mean they increase one by one all the time. It just mean it will try to produce sequential numbers, as long as it is possible. In multi-threaded apps that's usually not possible since batches or numbers are reserved per thread so the row insertion sequence may end actually not following the natural numbering. Row deletions have a similar effect, as well as INSERT with roll backs.
Primary keys are meant to be used for joining tables together and
indexing, they are not meant to be used for human usage. Reordering
primary key columns could orphan data and wreck havoc to your queries.
Tips: Add another column to your table and reorder that column to your will if needed (show that column to your user instead of the primary key).

Issues with a circular table reference MySQL

Not really a DBA, but was tasked with designing a couple new tables for a new feature in a web app. DB is MySQL, using NHibernate as ORM (though that's probably irrelevant to the question).
I'm going to be modelling various, "scenarios" which represent different variations of several designs in the app. Aside from the first scenario & "unstarted" scenarios, each scenario will have a parent scenario they're building from. As a result, will end up with a sort of "no-loop / no-merge" tree structure as scenarios are branched from one another.
CREATE TABLE `scenarios` (
`ScenarioID` INT NOT NULL AUTO_INCREMENT,
`DesignID` INT DEFAULT NULL,
`ScenarioTreeID` INT NOT NULL,
`ParentScenarioID` INT DEFAULT NULL,
`Title` varchar(255) CHARACTER SET utf8 DEFAULT NULL,
...
In addition to the scenarios themselves, there's information that's best related to the entire "Tree" of scenarios (e.g. what structure are the scenarios related to, etc). I've tried to factor this data out into another table called scenariotree and reference it from scenarios via ScenarioTreeID. The issue I ran into was, from a querying perspective, that it'd be important to know what the "root scenario" is when I query the tree (I can't just go WHERE ParentScenarioID is NULL as that includes "unstarted" scenarios). So I tried to set up the table as such:
CREATE TABLE `scenariotree` (
`ScenarioTreeID` INT NOT NULL AUTO_INCREMENT,
`StructureID` INT NOT NULL,
`RootScenario` INT DEFAULT NULL,
...
But then I couldn't create either table due to the circular foreign key references. I realise I can create the tables first & then add the foreign keys in (or just turn FK checks off & then on again when I'm finished), but should I be doing this? Poking around online I'm finding conflicting opinions. Basically what I want to ask is:
"Is this acceptable schema design, or am I going to run into issues down the road? If so, what issues am I likely to have & how might I restructure these tables to avoid them?"
It's fine to have circular references. They are less common than not have cycles, but they are legitimate to model some data structures.
They do require some special handling, as you discovered. That's okay and it's necessary.
You already identified two ways of handling them:
SET FOREIGN_KEY_CHECKS=0; temporarily while you insert the mutually-depended data. One problem with this is that some people forget to re-enable the checks, and then some weeks later discover that their data is full of references that point to non-existing data.
Create the table first, then use ALTER TABLE to add the foreign keys after you populate the data. The problem here is that if you need to add new rows to existing tables, you'd have to drop the foreign keys and re-add them every time, and this affects all clients, not just your session.
A couple of other options:
Make one or the other foreign key nullable. When you need to insert mutually-dependent rows in the two tables, insert the one with nullable FK first, and use a NULL. Then insert to the other table. Then UPDATE the first table to assign the non-NULL value it should reference.
Finally, don't use FOREIGN KEY constraints. You will have columns that reference other columns, but it's sort of on the "honor system" instead of having a RDBMS-enforced constraint. This comes with its own risks of course, because any data that is supposed to be a foreign key has no assurance that it correct. But it gives you total freedom to insert in whatever order you need to. You can use a transaction to make sure inserts to both tables happen together.

Database Structure for Inconsistent Data

I am creating a database for my company that will store many different types of information. The categories are Brightness, Contrast, Chromaticity, ect. Each category has a number of data points which my company would like to start storing.
Normally, I would create a table for each category which would store the corresponding data. (This is how I learned to do it). However, Sometimes these categories have "sub-data" which would change the number of fields required in each table.
My question is then how do people handle the inconsistency of data when structuring their databases? Do they just keep adding more tables for extra data or is it something else altogether?
There are a few (and thank goodness only a few) unbendable rules about relational database models. One of those is, that if you don't know what to store, you have a hard time storing it. Chances are, you'll have an even harder time retrieving it.
That said, the reality of business rules is often less clear cut than the ivory tower of database design. Most importantly, you might want or even need a way to introduce a new property without changing the schema.
Here are two feasable ways to go at this:
Use a datastore, that specializes in loose or inexistant schemas
(NoSQL and friends). Explaining this in detail is a subject of a CS
Thesis, not a stackoverflow answer.
My recommendation: Use a separate properties table - here is how
this goes:
Assuming for the sake of argument, your products allways have (unique string) name, (integer) id, brightness, contrast, chromaticity plus sometimes (integer) foo and (string) bar, consider these tables
CREATE TABLE products (
id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(50) NOT NULL,
brightness INT,
contrast INT,
chromaticity INT,
UNIQUE INDEX(name)
);
CREATE TABLE properties (
id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(50) NOT NULL,
proptype ENUM('null','int','string') NOT NULL default 'null',
UNIQUE INDEX(name)
);
INSERT INTO properties VALUES
(0,'foo','int'),
(0,'bar','string');
CREATE TABLE product_properties (
id INT PRIMARY KEY AUTO_INCREMENT,
products_id INT NOT NULL,
properties_id INT NOT NULL,
intvalue INT NOT NULL,
stringvalue VARCHAR(250) NOT NULL,
UNIQUE INDEX(products_id,properties_id)
);
now your "standard" properties would be in the products table as usual, while the "optional" properties would be stored in a row of product_properties, that references the product id and property id, with the value being in intvalue or stringvalue.
Selecting products including their foo if any would look like
SELECT
products.*,
product_properties.intvalue AS foo
FROM products
LEFT JOIN product_properties
ON products.id=product_properties.product_id
AND product_properties.property_id=1
or even
SELECT
products.*,
product_properties.intvalue AS foo
FROM products
LEFT JOIN product_properties
ON products.id=product_properties.product_id
LEFT JOIN properties
ON product_properties.property_id=properties.id
WHERE properties.name='foo' OR properties.name IS NULL
Please understand, that this incurs a performance penalty - in fact you trade performance against flexibility: Adding another property is nothing more than INSERTing a row in properties, the schema stays the same.
If you're not mysql bound then other databases have table inheritance or arrays to solve certain of those niche cases. Postgresql is a very nice database that you can use as easily and freely as mysql.
With mysql you could:
change your tables, add the extra columns and allow for NULL in the subcategory data that you don't need. This way integrity can be checked since you can still put constraints on the columns. Unless you really have a lot of subcategory columns this way I'd recommend this, otherwise option 3.
store subcategory data dynamically in a seperate table, that has a category_id,category_row_id,subcategory identifier(=type of subcategory) and a value column: that way you can retrieve your data by linking it via the category_id (determines table) and the category_row_id (links to PK of the original category table row). The bad thing: you can't use foreign keys or constraints properly to enforce integrity, you'd need to write hairy insert/update triggers to still have some control there which would push the burden of integrity checking and referential checking solely on the client. (in which case you'd properly be better of going NoSQL route) In short I wouldn't recommend this.
You can make a seperate subcategory table per category table, columns can be fixed or variable via value column(s) + optional subcategory identifier, foreign keys can still be used, best to maintain integrity is fixed since you'll have the full range of constraints at your disposal. If you have a lot of subcategory columns that would otherwise hopefully clutter your regular subcategory table then I'd recommend using this with fixed columns. Like the previous option I'd never recommend going dynamic for anything but throwaway data.
Alternatively if your subcategory is very variable and volatile: use NoSQL with a document database such as mongodb, mind you that you can keep all your regular data in a proper RDBMS and just storeside-data in the document database though that's probably not recommended.
If your subcategory data is in a known fixed state and not prone to change I'd just add the extra columns to the specific category table. Keep in mind that the major feature of a proper DBMS is safeguarding the integrity of your data via checks and constraints, doing away with that never really is a good idea.
If you are not limited to MySQL, you can consider Microsoft SQL server and using Sparse Columns This will allow you to expand your schema to include however many columns you want, without incurring the storage penalty for columns that are not pertinent for a given row.

How to deal with duplicates in database?

In a program, should we use try catch to check insertion of duplicate values into tables, or should we check if the value is already present in the table and avoid insertion?
This is easy enough to enforce with a UNIQUE constraint on the database side so that's my recommendation. I try to put as much of the data integrity into the database so that I can avoid having bad data (although sometimes unavoidable).
If this is how you already have it you might as well just catch the mysql exception for duplicate value insertion on such a table as doing the check then the insertion is more costly then having the database do one simple lookup (and possibly an insert).
Depends upon whether you are inserting one, or a million, as well as whether the duplicate is the primary key.
If its the primary key, read: http://database-programmer.blogspot.com/2009/06/approaches-to-upsert.html
An UPSERT or ON DUPLICATE KEY... The idea behind an UPSERT is simple.
The client issues an INSERT command. If a row already exists with the
given primary key, then instead of throwing a key violation error, it
takes the non-key values and updates the row.
This is one of those strange (and very unusual) cases where MySQL
actually supports something you will not find in all of the other more
mature databases. So if you are using MySQL, you do not need to do
anything special to make an UPSERT. You just add the term "ON
DUPLICATE KEY UPDATE" to the INSERT statement:
If it's not the primary key, and you are inserting just one row, then you can still make sure this doesn't cause a failure.
For your actual question, I don't really like the idea of using try/catch for program flow, but really, you have to evaluate readability and user experience (in this case performance), and pick what you think is the best of mix of the two.
You can add a UNIQUE constraint to your table.. Something like
CREATE TABLE IF NOT EXISTS login
(
loginid SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
loginname CHAR(20) NOT NULL,
UNIQUE (loginname)
);
This will ensure no two login names are the same.
you can Create a Unique Composite Key
ALTER TABLE `TableName` ADD UNIQUE KEY (KeyOne, KeyTwo, ...);
you just need to create a unique key in your table so that it will not permit to add the same value again.
You should try inserting the value and catch the exception. In a busy system, if you check for the existience of a value it might get inserted between the time you check and the time you insert it.
Let the database do it's job, let the database check for the duplicate entry.
A database is a computerized representation of a set of business rules and a DBMS is used to enforce these business rules as constraints. Neither can verify a proposition in the database is true in the real world. For example, if the model in question is the employees of an enterprise and the Employees table contains two people named 'Jimmy Barnes' DBMS (nor the database) cannot know whether one is a duplicate, whether either are real people, etc. A trusted source is required to determine existence and identity. In the above example, the enterprise's personnel department is responsible for checking public records, perusing references, ensuring the person is not already on the payroll, etc then allocating an unique employee reference number that can be used as a key. This is why we look for industry-standard identifiers with a trusted source: ISBN for books, VIN for cars, ISO 4217 for currencies, ISO 3166 for countries, etc.
I think it is better to check if the value already exists and avoid the insertion. The check for duplicate values can be done in the procedure that saves the data (using exists if your database is an SQL database).
If a duplicate exists you avoid the insertion and can return a value to your app indicating so and then show a message accordingly.
For example, a piece of SQL code could be something like this:
select #ret_val = 0
If exists (select * from employee where last_name = #param_ln and first_name = #param_fn)
select #ret_val = -1
Else
-- your insert statement here
Select #ret_val
Your condition for duplicate values will depend on what you define as a duplicate record. In your application you would use the return value to know if the data was a duplicate. Good luck!