I have a table with lots of redundant data.
I'd like to change some of the VARCHAR values to some sort of "AUTO_LOOKUP" data type, that automatically maintains and resolves values from a look-up table.
MySQL does this partially with the ENUM datatype, but it requires ahead-of-time definition of all known values. I would like the list of values to dynamically grow.
Does this exist?
Related questions:
Similar concept, using a custom datatype in Derby: User-defined types in Apache Derby as ENUM replacements
Similar concept, but handling 100% in the client application (I want it handled in the database instead): Ways to save enums in database
Yes, it's called a foreign key, and it's supported by virtually all relational databases.
You define your set of varchars in a lookup table, with one row per value. Associated with the string is a more compact primary key, typically an auto-increment integer.
Then in your large table, just reference the entry in the lookup table by integer.
Related
Using an object oriented programming language I'm in situation where I have an Object where one of the properties is the type of the object. When I save it to the database I always create two tables, one for the object itself and the other one for the object type (id, name). Then I assign the typeId to the object and create a Foreign Key to ObjectType table. I'm not sure that this is correct because:
I'm using a whole table to save only few records (5-10 possible Object types) and they will be rarely updated.
I need to do a JOIN between two tables to show the name of the object type.
In the code I'm declaring constants with the same ID as in the type table (to operate when programming and assigning the type to object) and I don't like this redundancy.
The other option is to use a string and save it directly in the object´s table but this doesn't sound good because the search will be slower than with typeId and there is no list about all the possible types. Also changing the name of one type is more complicated. Can you advice me what is the best thing to do in this situation?
In this case, you have 3 options:
Varchar
Enum Fields
an other Joined table (Lookup table).
And you have 5 evaluation parameters:
Redundancy
Extendability
Modifiability
Performance
Simplicity
Varchar:
Very Bad (you should copy the values)
Good (you can add new types easily)
Very Bad (you should change all inserted data with the same value)
Excellent (based on reference)
Excellent (easy to use as other fields and in ORMs)
Enum Fields:
Good (DBMS control the redundancy)
Very Very Bad (based on reference)
Very Very Bad (based on reference)
Excellent (based on reference)
Very Good (some ORMs behave as String Field-see this reference)
an other Joined table (Lookup Table):
Excellent (ER normal method)
Excellent (ER normal method)
Excellent (ER normal method)
Normal (maybe bad - if the speed is so important)
Normal (depend on programmer)
References:
see Performance analysis here (Enum Fields VS Varchar VS Int + Joined table: What is Faster?).
see Enum advantages and disadvantages here (8 Reasons Why MySQL's ENUM Data Type Is Evil).
see Lookup table and Varchar example in mysql here in SO.
Finally: based on your evaluation parameters, you can choose the proper option.
The other possibility is to use just an Integer number in Database and manage all things in source code. I do not put it in evaluation because it is not database design. It is programming approach.
In my tables I use an auto-increment PK on tables where I store for example posts and comments.
I don't want to expose the PK to the HTTP client, however, I still use it internally in my API implementation to perform quick lookups.
When a user wants to retrieve a post by id, I want to have an alternate unique key on the table.
I wonder what is the best (most common) way to use as type for this field.
The most obvious to me would be to use a UUID or GUID.
I wonder if there is a straightforward way to generate a random numeric key for this instead for performance.
What is your take on the best approach for this situation?
MySQL has a function that generates a 128-bit UUID, version 1 as described in RFC 4122, and returns it as a hex string with dashes, by the custom of UUID formatting.
https://dev.mysql.com/doc/refman/5.7/en/miscellaneous-functions.html#function_uuid
A true UUID is meant to be globally unique in space and time. Usually it's overkill unless you need a distributed set of independent servers to generate unique values without some central uniqueness validation, which could create a bottleneck.
MySQL also has a function UUID_SHORT() which generates a 64-bit numeric value. This does not conform with the RFC, but it might be useful for your case.
https://dev.mysql.com/doc/refman/5.7/en/miscellaneous-functions.html#function_uuid-short
Read the description of the UUID_SHORT() implementation. Since the upper bits are seldom changing, and the lower bits are simply monotonically incrementing, it avoids the performance and fragmentation issues caused by inserting random UUID values into an index.
The UUID_SHORT value also fits in a MySQL BIGINT UNSIGNED without having to use UNHEX().
I'm working on a website which should be multilingual and also in some products number of fields may be more than other products (for example may be in the future a products have an extra feature which old products doesn't have it). because of this problem I decided to have a product table with common fields which all products can have and in all languages are same (like width and height) and add another three tables for storing extra fields as below:
field (id,name)
field_name(field_id,lang_id,name)
field_value(product_id, field_id, lang_id, value)
by doing this I can fetch all the values from one table but the problem is that values can be in different types, for example it could be a number or a text. I checked on an open source project "Drupal" and in that they create a table for each field type and by doing joins they will retrieve a node data. I want to know which way will impact the performance more? having a table for each extra field or storing all of their value in one table and convert their type on the fly by casting?
thank you in advance
Yes, but no. You are storing your data in an entity-attribute-value form (EAV). This is rather inefficient in general. Here are some issues:
As you have written it, you cannot do type checking.
You cannot set-up foreign key relationships in the database.
Fetching the results for a single row requires multiple joins or a group by.
You cannot write indexes on a specific column to speed access.
There are some work-arounds. You can get around the typing issue by having separate columns for different types. So, the data structure would have:
Name
Type
ValueString
ValueInt
ValueDecimal
Or whatever types you want to support.
There are some other "tricks" if you want to go this route. The most important is to decimal align the numbers. So, instead of storing '1' and '10', you would store ' 1' and '10'. This makes the value more amenable to ordering.
When faced with such a problem, I often advocate a hybrid approach. This approach would have a fixed record with the important properties all nicely located in columns with appropriate types and indexes -- columns such as:
ProductReleaseDate
ProductDescription
ProductCode
And whatever values are most useful. An EAV table can then be used for additional properties that are optional. This generally balances the power of the relational database to handle structured data along with the flexibility of an EAV approach to support variable columns.
I am using SQL Server 2008 R2.
In a table, I do have a column of type VARCHAR(50). I am storing some alpha-numeric dat` in the same. The data are of different length i.e. inserted from user input.
Now, I do have a requirement to store NEWID() - UNIQUEIDENTIFIER type of data in the same. But I have already existing data that I can not modify. So, instead of converting the datatype of column from VARCHAR to UNIQUEIDENTIFIER I am thinking about storing NEWID() in the VARCHAR format in the same column.
Is it advisable to do the same?
If the column you are speaking of is not to be used as a primary key or index, then you will not have any issues - you can store there whatever you want.
If the column has already been an index or even primary key, you will have the same issues you already have (regarding to your post saying the data is of variable length)
If you introduce an index or primary key when making this change, then you should take into accounts the possible performance impacts of using a variable length column for this.
A good advice, if the above is innevitable, would be to attempt to reduce the size of the column to 36 (I think this was the standard length of the unique identifiers generated by most systems and frameworks, including the UNIQUEIDENTIFIER type of SQL Server, which is represented via 5 alpha-numerical entities, totally constituting of 32 symbols separated with (4) hyphens).
If your current data can be sized to 36 symbols safely, I recommend redefining the column before making it an index or PK, especially if your table has many rows already.
If possible, also redefine it as CHAR(36) (or even NCHAR(36) as advised in comments) - fixed length columns perform better as indices. Also, MS SQL server 2005 and above support the NEWSEQUENTIALID() function for generating sequential ids. These perform better for index or PK columns.
I have inherited a LinqToSql application which is making use of GUID keys for objects.
I'd rather use conventional identity fields - much easier for people to use, understand and communicate. However there is some business logic that requires the application to identify unique objects before they're persisted to the DB which is why GUIDs where used in the first place.
Another issue we're having is with fragmented indexes - AFAIK we can't create sequential GUIDs in .Net code.
As this is my first exercise in LinqToSql I'd like to know how others have addressed this issue.
BTW there is no need for the data between multiple servers to be combined - the main (only) reason that I've used GUID keys in the past.
No, you don't have to use Guids, you can use any key type you'd like.
If you are stuck with Guids consider having the database generate them sequentially for you by making the default binding for the pk field newsequentialid(). This will eliminate fragmentation in your clustered index at least. You need to make a few modifications to the .dbml if you do this. On the key field in the .dbml Auto Generated Value = true and Auto-Sync = OnInsert
As far as generating the value before you insert to the database I don't see how using an identity field helps you. You will still have to insert to the database to reliably get the correct value. (Identity columns will have the same Autogenerated/Autosync settings as above)
Ints or Guids, you should be able to wrap the insert in a transaction, insert the record, grab the new key value, run your business logic and if it fails roll back the newly inserted record.