I need to create a water tourism portal and I am thinking: Is it possible to save in a table some values from enumerable?
For example, track can have different types of boats: kayak, boat, canoe. So the person who creates a track can choose that track is valid only for one of the types or 2 or 3. How can I store this data? I am thinking about enumerator but I am not sure if I will be able to store this data in a table.
While there is an ENUM type, I generally recommend against using it. It has some unconventional behavior at times (you can reference values by index, and the data type is not handled well by many APIs), and modifying the list of values requires altering the table structure (which requires rebuilding the table, data and all, behind the scenes).
You are much better off creating a lookup table with the enum int value as an id and a string for the values' names. Your "tracks" table can just reference that, as can whatever interface you provide for users to select a boat type. Using an ENUM would mean you either have to the boat types embedded in code behind the user interface, that you then have to coordinate with the the enum values in the table definition; or querying the table structure, and parsing the data type for the "boat type" field.
Note: If different types need different handling, it can be very helpful to have a code enum mirror such a lookup table, or rather have a lookup table reflect a code enum, then the lookup table mainly serves to enforce data integrity on the database side, and to aid in displaying the data in a user intelligible way.
Also, keeping future expansion in mind, if the tourism portal later decide to start facilitating rentals, the boats that can be rented will likely have types; so you either have to duplicate the ENUM, or just reference the same lookup table.
The functionality you're looking for is provided by the SET data type, which lets you assign to a column zero or more elements from a given set of (no more than 64) elements (see documentation).
Recommendations from Uueerdo still apply, of course.
Related
I can't find a term for what I'm trying to do so that may be limiting my ability to find info related to my question.
I'm trying to relate product identifiers and product processing codes (orange table in fig.) with validation against what product types and subtypes are valid for each process code based on process type. Importantly, each product identifier is related to a product type (see ProductIdentifier table) and each process code is related to process type (see ProcessCode table). I minimized the attributes in the tables below to only those necessary for my question.
In the above example, when I INSERT INTO the RunProcessTypeOne table, I need to validate that the ProductCode for RoleOneProductIdentifier is present in ProductTypeTwo. Similarly, I need to validate that the ProductCode for RoleTwoProductIdentifier is present in ProductSubtypeOne.
Of course I can use a stored procedure that inserts into the RunProcessTypeOne table after running SELECT to check for the presence of the ProductCode related to RoleOneProductIdentifier and RoleTwoProductIdentifier in the relevant tables. This doesn't seem optimal since I'm having to run three SELECTs for every INSERT. Plus, it seems fishy that the relationship between ProcessTypes and ProductCodes would only be known within the stored procedure and not via relationships established between the tables themselves (foreign key).
Are there alternatives to this approach? Is there a standard for handling this type of validation where you need to validate individual instances (e.g. ProductIdentifiers) of entity types based on the relationships between those types (e.g. the relationship between ProductTypeTwo and ProcessTypeOne)?
If more details are helpful: The relationship between ProductCode and ProcessCode is many-to-many but there are rules that define product roles in each process and only certain product types or subtypes may fulfill those roles. ProductTypeOne might include attributes that define a specific kind of product like color or shape. ProductIdentifier includes the many lots of any ProductCode that are manufactured. ProcessCode includes settings that are put on a machine for processing. ProductType by way of ProductCode determines if a ProductIdentifier is valid for a particular ProcessType. Individual ProcessCodes don't discriminate valid ProducIdentifiers, only the ProcessType related to the ProcessCode would discriminate.
it seems fishy that the relationship between ProcessTypes and ProductCodes would only be known within the stored procedure and not via relationships established between the tables themselves (foreign key).
Yes that's an important observation, good to see you questioning the current schema. The fact of the matter is that SQL is not very powerful when it comes to representing data structures. So often a stored procedure is the only/least worst approach.
I'll make a suggestion for how to achieve this without stored procedures, but I won't call it "optimal": there's likely to be a performance hit for INSERTs (and worse for UPDATEs), because the SQL engine will probably be in effect carrying out the same SELECTs as you'd code in a stored procedure.
Split table ProductIdentifier into two:
ProductIdentifierTypeTwo PK ProductIdentifier, ProductCode FK REFERENCES ProductTypeTwo.ProductCode.
ProductIdentifierTypeOne PK ProductIdentifier, ProductCode FK REFERENCES ProductTypeOne.ProductCode.
Also CREATE VIEW ProductIdentifier UNION the two sub-tables, PK ProductIdentifier. This makes sure ProductIdentifier isn't duplicated between the two types.
IOW this avoids the ProductIdentifier table directly referencing the ProductCode table, where it can only examine ProductType as a column value, not as a referential structure.
Then
RunProcessTypeOne.RoleOneProductIdentifier FK REFERENCES ProductIdentifierTypeTwo.ProductIdentifier.
RunProcessTypeOne.RoleTwoProductIdentifier FK REFERENCES ProductIdentifierTypeOne.ProductIdentifier.
Making the original ProductIdentifier a VIEW is the least non-optimal way to manage updates (I'm guessing from your comment): ProductIdentifiers are less volatile than RunProcesses.
Re your more general question:
Is there a standard for handling this type of validation where you need to validate individual instances (e.g. ProductIdentifiers) of entity types based on the relationships between those types (e.g. the relationship between ProductTypeTwo and ProcessTypeOne)?
There are facilities included in the SQL standard. Most vendors haven't implemented them, or only partially support them -- essentially because implementing them would need running SELECTs with tricky logic as part of table updates.
You should be able to CREATE VIEW with a filter to only the rows that are the target of some FK.
(Your dba is likely to object that VIEWs come with an unacceptable performance hit. In this example, you'd have a single ProductIdentifier table, with the two sub-tables I suggest above as VIEWs. But maintaining those views would need joining to ProductCode to filter by ProductType.)
Then you should be able to define a FK to the VIEW rather than to the base table.
(This is the bit many SQL vendors don't support.)
Often I find myself creating 'status' fields for database tables. I set these up as TINYINT(1) as more than often I only need a handful of status values. I cross-reference these values to array-lookups in my code, an example is as follows:
0 - Pending
1 - Active
2 - Denied
3 - On Hold
This all works very well, except I'm now trying to create better database structures and realise that from a database point of view, these integer values don't actually mean anything.
Now a solution to this may be to create separate tables for statuses - but there could be several status columns across the database and to have separate tables for each status column seems a bit of overkill? (I'd like each status to start from zero - so having one status table for all statuses wouldn't be ideal for me).
Another option is to use the ENUM data type - but there are mixed opinions on this. I see many people not recommending to use ENUM fields.
So what would be the way to go? Do I absolutely need to be putting this data in to its own table?
I think the best approach is to have a single status table for each kind of status. For example, order_status ("placed", "paid", "processing", "completed") is qualitatively different from contact_status ("received", "replied", "resolved"), but the latter might work just as well for customer contacts as for supplier contacts.
This is probably already what you're doing — it's just that your "tables" are in-memory arrays rather than database tables.
As I really agree with "ruakh" on creating another table structured as id statusName which is great. However, I would like to add that for such a table you can still use tinyint(1) for the id field. as tinyint accepts values from 0 to 127 which would cover all status cases you might need.
Can you add (or remove) a status value without changing code?
If yes, then consider a separate lookup table for each status "type". You are already treating this data in a generic way in your code, so you should have a generic data structure for it.
I no, then keep the ENUM (or well-documented integer). You are treating each value in a special way, so there isn't much purpose in trying to generalize the data model.
(I'd like each status to start from zero - so having one status table for all statuses wouldn't be ideal for me
You should never mix several distinct sets of values within the same lookup table (regardless of your "zero issue"). Reasons:
A simple FOREIGN KEY alone won't be able to prevent referencing a value from the wrong set.
All values are forced into the same type, which may not always be desirable.
That's such a common anti-pattern that it even has a name: "one true lookup table".
Instead, keep each lookup "type" within a separate table. That way, FKs work predictably and you can tweak datatypes as necessary.
I have a base enitiy (items) that will host a vast range of item types (>200) with totaly different properties. I want a clean portable and fast solution and have come up with an idea that maby has a name I'm unaware of.
Here it goes:
items-entity holds base class fields + additional fields for subclass fields but with dummie-names, ItemID,ItemNo,ItemTypeID,int1,int2,dec1,dec2,dec3,str1,str2
referenced itemtype-record holds name of type and child enity (1:n):
itemtypefields [itemtypeid,name,type,realfield]
example in [53,MaxPressure,dec,dec3]
It's limitations:
hard to estimate field requirements in baseclass
harder to add domains/checkconstraints based on child type
need application layer to translate tagged sql to real query
Only possible to query one type at a time since shared attributes may be defined to different "real-fields".
3rd bullet explained:
select ItemNo,_MaxPressure_ from items where ItemTypeID=10 and _MaxPressure_>42
should translate to:
select ItemNo,dec3 as MaxPressure from items where ItemType=10 and dec3>42
(can't do that with sp's or udf's right - or whould it be possible?)
But benefits of:
Performance
Ease of CRUD-operations
Easier to sort/filter at application level.
Now - does it have a name?
This antipattern is called One True Lookup Table.
In a relational database, each column needs to be defined as one logical type. I don't mean one SQL data type like INT or VARCHAR, I mean everything in that column from start to finish must be from the same set of values, and you should be able to tell one value apart from another value.
You can't put shoe size and average temperature and threads per inch into the same column of a given table, and still call it a relation.
Basically, your database would not be a database at all -- it would be a spreadsheet.
Read almost any book by C. J. Date, such as SQL and Relational Theory for a proper explanation of relations and types.
Re your comment:
Read the Q again before lecuturing about elementary books and mocking about semi structured data.
Okay, I have re-read your post.
The classic use of One True Lookup Table isn't exactly what you're doing, but what you're doing shares the same problems with OTLT.
Suppose you have "MaxPressure" stored in column dec3 for ItemType 10. Suppose there are a fixed set of valid choices for the value of MaxPressure, and you want to put those in another lookup table, so that no one can enter an invalid MaxPressure value.
Now: declare a foreign key constraint on dec3 referencing your MaxPressures lookup table. You can't -- the problem is that the foreign key constraint applies to the dec3 column in all rows, not just those rows where ItemType is 10.
The reason is that you're storing more than one set of values in a single column. The same problem arises for any other kind of constraint -- unique constraints, check constraints, even NOT NULL. And you can't declare a DEFAULT value for the column either, because you probably have a different correct default for each ItemType (and some ItemTypes have no default for that attribute).
The reason that I referred to the C. J. Date book is that he gives a crisp definition for a type: it's a named finite set, over which the equality operation is defined. That is, you can tell if the value "42" on one row is the same as the value "42" on another row. In a relational column, that must be true because they must come from the same original set of values. In your table, dec3 could have the value "42" when it's MaxPressure, but "42" for another ItemType when it's threads per inch. Therefore they aren't the same value "42". If you had a unique constraint, these two 42's would not be considered duplicates. If you had a foreign key, each of the different 42's would reference a different lookup table, etc.
What you're doing is not a valid relational database design.
Don't bristle at my referring you to a resource on relational database design unless you understand that.
I have a number of different objects with a varying number of attributes. Until now I have saved the data in XML files which easily allow for an ever changing number of attributes. But I am trying to move it to a database.
What would be your preferred way to store this data?
A few strategies I have identified so far:
Having one single field named "attributes" in the object's table and store the data serialized or json'ed in there.
Storing the data in two tables (objects, attributes) and using a third to save the relations, making it a true n:m relation. Very clean solution, but possibly very expensive to fetch an entire object and all its attributes
Identifying attributes all objects have in common and creating fields for these to the object's table. Store the remaining attributes as serialized data in another field. This has an advantage over the first strategy, making searches easier.
Any ideas?
If you ever plan on searching for specific attributes, it's a bad idea to serialize them into a single column, since you'll have to use per-row functions to get the information out - this rarely scales well.
I would opt for your second choice. Have a list of attributes in an attribute table, the objects in their own table, and a many-to-many relationship table called object attributes.
For example:
objects:
object_id integer
object_name varchar(20)
primary key (object_id)
attributes:
attr_id integer
attr_name varchar(20)
primary key (attr_id)
object_attributes:
object_id integer references (objects.object_id)
attr_id integer references (attributes.attr_id)
oa_value varchar(20)
primary key (object_id,attr_id)
Your concern about performance is noted but, in my experience, it's always more costly to split a column than to combine multiple columns. If it turns out that there are performance problems, it's perfectly acceptable to break 3NF for performance reasons.
In that case I would store it the same way but also have a column with the raw serialized data. Provided you use insert/update triggers to keep the columnar and combined data in sync, you won't have any problems. But you shouldn't worry about that until an actual problem surfaces.
By using those triggers, you minimize the work required to only when the data changes. By trying to extract sub-column information, you do unnecessary work on every select.
A variation on your 2d solution is just two tables (assuming all attributes are of a single type):
T1: |Object data columns|Object_id|
T2: |Object id|attribute_name|attribute value| (unique index on first 2 columns)
This is even more efficient when combined with 3rd solution, e.g. all of the common fields go into T1.
Sstuffing >1 attribute into the same blob is no recommended - you can not filter by attributes, you can not efficiently update them
Let me give some concreteness to what DVK was saying.
Assuming values are of same type the table would look like (good luck, I feel you're going to need it):
dynamic_attribute_table
------------------------
id NUMBER
key VARCHAR
value SOMETYPE?
example (cars):
|id| key | value |
---------------------------
| 1|'Make' |'Ford' |
| 1|'Model' |'Edge' |
| 1|'Color' |'Blue' |
| 2|'Make' |'Chevrolet'|
| 2|'Model' |'Malibu' |
| 2|'MaxSpeed'|'110mph' |
Thus,
entity 1 = { ('Make', 'Ford'), ('Model', 'Edge'), ('Color', 'Blue') }
and,
entity 2 = { ('Make', 'Chevrolet'), ('Model', 'Malibu'), ('MaxSpeed', '110mph') }.
If you are using a relational db, then I think you did a good job listing the options. They each have their pros and cons. YOU are in the best position to decide what works best for your circumstances.
The serialized approach is probably the fastest (depending on your code for de-serializing), but it means that you won't be able to query the data with SQL. If you say that you don't need to query the data with SQL, then I agree with #longneck, maybe you should use a key/value style db instead of a relational db.
EDIT - reading more of your comments, WHY are you switching to a db if speed is your main concern. What's BAD about your current XML implementation?
I used to implement this scheme:
t_class (id RAW(16), parent RAW(16)) -- holds class hierachy.
t_property (class RAW(16), property VARCHAR) -- holds class members.
t_declaration (id RAW(16), class RAW(16)) -- hold GUIDs and types of all class instances
t_instance (id RAW(16), class RAW(16), property VARCHAR2(100), textvalue VARCHAR2(200), intvalue INT, doublevalue DOUBLE, datevalue DATE) -- holds 'common' properties
t_class1 (id RAW(16), amount DOUBLE, source RAW(16), destination RAW(16)) -- holds 'fast' properties for class1.
t_class2 (id RAW(16), comment VARCHAR2(200)) -- holds 'fast' properties for class2
--- etc.
RAW(16) is where Oracle holds GUIDs
If you want to select all properties for an object, you issue:
SELECT i.*
FROM (
SELECT id
FROM t_class
START WITH
id = (SELECT class FROM t_declaration WHERE id = :object_id)
CONNECT BY
parent = PRIOR id
) c
JOIN property p
ON p.class = c.id
LEFT JOIN
t_instance i
ON i.id = :object_id
AND i.class = p.class
AND i.property = p.property
t_property hold stuff you normally don't search on (like, text descriptions etc.)
Fast properties are in fact normal tables you have in the database, to make the queries efficient. They hold values only for the instances of a certain class or its descendants. This is to avoid extra joins.
You don't have to use fast tables and limit all your data to these four tables.
sounds like you need something lick couchdb, not an RDBMS.
if you are going to edit/manipulate/delete the attributes in later point, making a true n:m (second option) will be the one which I go for. (Or try to make it 2 table where the same attribute repeats.But data size will be high)
If you are not dealing with attributes(just capturing and showing the data) then you can go ahead and store in one field with some separator(Make sure the separator wont occur in the attribute value)
I am assuming you do not have digital attribute soup, but that there is some order to your data.
Otherwise, an RDBMS might not be the best fit. Something along NO SQL might work better.
If your objects are of different types, you should generally have one table per type.
Especially if you want to connect them using primary keys. It also helps to bring order and sanity if you have Products, Orders, Customers, etc tables, instead of just an Object and Attribute table.
Then look at your attributes. Anything that exists more than, say for 50% of the objects in that type category, make it a column in the object's table and use null when it's not being used.
Anything that is mandatory, should, of course, be defined as a NOT NULL column.
The rest, you can either have one or several "extra attributes" tables for.
You could put the attribute names into the table with the values, or normalize them out in a separate table and only use the primary key in the value table.
You may also find that you have combinations of data. For instance, a variant of an object type always has a certain set of attributes while another variant of the same object type has another set of attributes.
In that case, you might want to do something like:
MainObjectTable:
mainObjectId: PRIMARY KEY
columns...
MainObjectVariant1Table:
mainObjectId: FOREIGN KEY TO MainObjectTable
variant1Columns...
MainObjectVariant2Table:
mainObjectId: FOREIGN KEY TO MainObjectTable
variant2Columns...
I think the hard work, that will pay off, in the long run, is to analyze the data, find the objects and the commonly used attributes and make it into a good "object/ERD/DB" model.
We are currently thinking about different ways to implement custom fields for our web application. Users should be able to define custom fields for certain entities and fill in/view this data (and possibly query the data later on).
I understand that there are different ways to implement custom fields (e.g. using a name/value table or using alter table etc.) and we are currently favoring using ALTER TABLE to dynamically add new user fields to the database.
After browsing through other related SO topics, I couldn't find any big drawbacks of this solution. In contrast, having the option to query the data in fast way (e.g. by directly using SQL's where statement) is a big advantage for us.
Are there any drawbacks you could think of by implementing custom fields this way? We are talking about a web application that is used by up to 100 users at the same time (not concurrent requests..) and can use both MySQL and MS SQL Server databases.
Just as an update, we decided to add new columns via ALTER TABLE to the existing database table to implement custom fields. After some research and tests, this looks like the best solution for most database engines. A separate table with meta information about the custom fields provides the needed information to manage, query and work with the custom fields.
The first drawback I see is that you need to grant your application service with ALTER rights.
This implies that your security model needs careful attention as the application will be able to not only add fields but to drop and rename them as well and create some tables (at least for MySQL).
Secondly, how would you distinct fields that are required per user? Or can the fields created by user A be accessed by user B?
Note that the cardinality of the columns may also significantly grow. If every user adds 2 fields, we are already talking about 200 fields.
Personally, I would use one of the two approaches or a mix of them:
Using a serialized field
I would add one text field to the table in which I would store a serialized dictionary or dictionaries:
{
user_1: {key1: val1, key2, val2,...},
user_2: {key1: val1, key2, val2,...},
...
}
The drawback is that the values are not easily searchable.
Using a multi-type name/value table
fields table:
user_id: int
field_name: varchar(100)
type: enum('INT', 'REAL', 'STRING')
values table:
field_id: int
row_id: int # the main table row id
int_value: int
float_value: float
text_value: text
Of course, it requires a join and is a bit more complicated to implement but far more generic and, if indexed properly, quite efficient.
I see nothing wrong with adding new custom fields to the database table.
With this approach, the specific/most appropriate type can be used i.e. need an int field? define it as int. Whereas with a name/value type table, you'd be storing multiple data types as one type (nvarchar probably) - unless you complete that name/value table with multiple columns of different types and populate the appropriate one but that is a bit horrible.
Also, adding new columns makes it easier to query/no need to involve a join to a new name/value table.
It may not feel as generic, but I feel that's better than having a "one-size fits all" name/value table.
From an SQL Server point of view (2005 onwards)....
An alternative, would be to store create 1 "custom data" field of type XML - this would be truly generic and require no field creation or the need for a separate name/value table. Also has the benefit that not all records have to have the same custom data (i.e. the one field is common, but what it contains doesn't have to be). Not 100% on the performance impact but XML data can be indexed.