The needs would be long to describe, so I'll simplify the example.
I want to make a form creation system ( the user can create a form, adding fields, etc... ). Let's focus on checkbox vs textarea.
The checkbox can have a value of 0 or 1, depending on the checked status.
The textarea must be a LONGTEXT type.
So in the database, that give me 3 choices concerning the structure of the table field_value:
1.
checkbox_value (TINYINT) | textarea_value (MEDIUMTEXT)
That mean that no input will ever use all column of the table. The table will waste some space.
2.
allfield_value (MEDIUMTEXT)
That mean that for the checkbox, I'll store a really tiny value in a MEDIUMTEXT, which is useless.
3.
tblcheckbox.value
tbltextarea.value
Now I have 1 separate table per field. That's optimal in terms of space, but in the whole context of the application, I might expect to have to read over 100 tables -- 1 query with a many JOIN ) in order to generate a single page that display a form.
In your opinion, what's the best way to proceed?
Do not consider an EAV data model. It's easy to put data in, but hard to get data out. It doesn't scale. It has no data integrity. You have to write lots of code yourself to do things that any RDBMS does for you if you model your data properly. Trying to use an RDBMS to create a general-purpose form management system that can accommodate any future needs is an example of the Inner-Platform Effect antipattern.
(By the way, if you do use EAV, don't try to join all the attributes back into a single row. You already commented that MySQL has a limit on the number of joins per query, but even if you can live within that, it doesn't perform well. Just fetch an attribute per row, and sort it out in application code. Loop over the attribute rows you fetch from the database, and populate your object field by field. That means more code for you to write, but that's the price of Inner-Platform Effect.)
If you want to store form data relationally, each attribute would go in its own column. This means you need to design a custom table for your form (or actually set of tables if your forms support multivalue fields). Name the columns according to the meaning of each given form field, not something generic like "checkbox_value". Choose a data type according to the needs of the given form field, not a one-size-fits-all MEDIUMTEXT or VARCHAR(255).
If you want to store form data non-relationally, you have more flexibility. You can use a non-relational document store such as MongoDB or even Solr. You can store documents without having to design a schema as you would with a relational database. But you lose many of the structural benefits that a schema gives you. You end up writing more code to "discover" the fields of documents instead of being able to infer the structure from the schema. You have no constraints or data types or referential integrity.
Also, you may already be using a relational database successfully for the rest of your data management and can't justify running two different databases simultaneously.
A compromise between relational and non-relational extremes is the Serialized LOB design, with the extension described in How FriendFeed Uses MySQL to Store Schema-Less Data. Most of your data resides in traditional relational tables. Your amorphous form data goes into a single BLOB column, in some format that encodes fields and data together (for example, XML or JSON or YAML). Then for any field of that data you want to be searchable, create an auxiliary table to index that single field and reference rows of form data where a given value in that respective field appears.
You might want to consider an EAV data model.
Related
I'm planning of storing a large amount of data from a user submitted form (around 100 questions) in a json field.
I will only need to access for queries for two pieces of data from the form, name and type.
Would it be advisable (and more efficient), to extract name and type to their own fields for querying or shall I just whack it all in one json field and query that json field since json searching is now supported?
If you are concerned about performance, then maintaining separate fields for the name and type is probably the way to go here. The reason for this is that if these two points of data exist as separate fields, it leaves open the possibility to do things like add indices to those columns. While you can use MySQL's JSON API to query by name and type, it would most likely would never be able to compete with an index lookup, at least not in terms of performance.
From a storage point of view, you would not pay much of a price to maintain two separate columns. The main price you would pay is that everytime the JSON gets updated, you would have to also update the name and type columns.
In database table design, which of the following is better design for event-log type of data growth
Design 1) Numeric columns(Long) and character columns (Varchar2) with
Index:
..(pkey)|..|..|StockNumber Long | StockDomain Varchar2 |...
.. |..|..|11111 | Finance
.. |..|..|23458 | Medical
Design 2) Character column Varchar2 with Index:
..(pkey)|..|..|StockDetails Varchar2(1000) |..|..
.. |..|..|11111;Finance |..|..
.. |..|..|23458;Medical |..|..
Design advantages: First design is very specific and Second design is more general which can accommodate more general data.In both the cases, columns indexed.
Storage: First design indexes require less storage than second
Performance: Same?
I am having a question about performance vs flexibility. Obviously, first design is better. But second design is the more general purpose. Let me know your insights
Note: Edited the question for more clarity.
In general, having discrete columns is the better way to go for a few reasons:
Datatypes - You have guarantees that the data you have saved is in the right formats, at least as far as non string columns go, your stockNumber will always be a number if it's a bigint/long, trying to set it to anything else will cause your insert/update to error. As part of a colon separated value (CSV) string there is a chance of bad data when it's part of a string.
Querying - Querying a single column has to be done using LIKE since you are looking for a substring of the single column string. If I look for WHERE StockDetails LIKE '%11111%' I will find the first line, but I may find another line where a dollar value inside that column, in a different field is $11111. With discrete columns your query would be WHERE StockNumber = 11111 guaranteeing it finds the data only in that column.
Using the data - Once you have found the row you're wanting, you then have to read the data. This means parsing out your CSV into separate fields. If one of those fields had a colon in it, and it is improperly escaped, the rest of the data is going to be parsed wrong, and you still need your values in a guaranteed same order, leaving blank sections ;; where you would have had a null value in a column.
There is a middle ground between storing CSVs and a separate columns. I have seen, and in fact am doing on one major project, data stored in a table as json. With json you have property names, so you don't care the order the fields appear in the string, because domain will still always be domain, any non standard fields you don't need in an entry (say a property that only exists for the medical domain) will just not be there rather than needing a blank double colon, and parsers for json exist in all languages I can think of that you would connect to your database, there's no need to manually code something to parse out your CSV string. For example your StockDetails given above would look like this:
+--------------------------------------+
| StockDetails |
+--------------------------------------+
| {"number":11111, "domain":"Finance"} |
| {"number":23458, "domain":"Medical"} |
+--------------------------------------+
This solves issues 2 and 3 above:
You now write your query as WHERE StockDetails LIKE '%"number":11111 including the json property name guarantees you don't find the data anywhere else in your string.
You don't need to worry about fields out of order, or missing in your string causing your data to be unusable, using json gives you the key/value pair, all you need to do is handle nulls where the key doesn't exist. This also lets you add fields easily, adding a new CSV field can break your code to parse it, the number of values will be off for your existing data, so you will need to update all rows potentially, however since in json you only store non null fields, a new field will be treated like any other null value on existing data.
In relational database design, you need discrete columns. One value per column per row.
This is the only way to use data types and constraints to implement some data integrity. In your second design, how would you implement a UNIQUE constraint on either StockNumber or StockDomain? How would you make sure StockNumber is actually a number?
This is the only way to create indexes on each column individually, or create a compound index that puts the StockDomain first.
As an analogy, look in the telephone book: can you find all people whose first name is "Bill" easily or efficiently? No, you have to search the whole book to find people with a specific first name. The order of columns in an index matters.
The second design is practically not a database at all — it's a file.
To respond to your comments, I'm reiterating what I wrote in a comment:
Sometimes denormalization is worthwhile, but I can't tell [if your second design is worthwhile], because you haven't described how you will query this data. You must take into account your query needs before you can decide on any optimization.
Stated another way: denormalization, like all other optimizations, benefits one query type, at the expense of other query types. Therefore you need to know which queries you need to be optimal, and which queries are less important, so it won't hurt your overall performance if the other queries are degraded.
If you can't predict the queries, default to designing a database with rules of normalization. Normalization is not designed for performance optimization, it's designed to prevent data anomalies, which is a good goal too.
You have posted several new comments, I guess in the hopes that I will suddenly understand and endorse your second design. But you still haven't described any specific query that will be optimized by using your second design.
I'm working on a website which should be multilingual and also in some products number of fields may be more than other products (for example may be in the future a products have an extra feature which old products doesn't have it). because of this problem I decided to have a product table with common fields which all products can have and in all languages are same (like width and height) and add another three tables for storing extra fields as below:
field (id,name)
field_name(field_id,lang_id,name)
field_value(product_id, field_id, lang_id, value)
by doing this I can fetch all the values from one table but the problem is that values can be in different types, for example it could be a number or a text. I checked on an open source project "Drupal" and in that they create a table for each field type and by doing joins they will retrieve a node data. I want to know which way will impact the performance more? having a table for each extra field or storing all of their value in one table and convert their type on the fly by casting?
thank you in advance
Yes, but no. You are storing your data in an entity-attribute-value form (EAV). This is rather inefficient in general. Here are some issues:
As you have written it, you cannot do type checking.
You cannot set-up foreign key relationships in the database.
Fetching the results for a single row requires multiple joins or a group by.
You cannot write indexes on a specific column to speed access.
There are some work-arounds. You can get around the typing issue by having separate columns for different types. So, the data structure would have:
Name
Type
ValueString
ValueInt
ValueDecimal
Or whatever types you want to support.
There are some other "tricks" if you want to go this route. The most important is to decimal align the numbers. So, instead of storing '1' and '10', you would store ' 1' and '10'. This makes the value more amenable to ordering.
When faced with such a problem, I often advocate a hybrid approach. This approach would have a fixed record with the important properties all nicely located in columns with appropriate types and indexes -- columns such as:
ProductReleaseDate
ProductDescription
ProductCode
And whatever values are most useful. An EAV table can then be used for additional properties that are optional. This generally balances the power of the relational database to handle structured data along with the flexibility of an EAV approach to support variable columns.
I'm building a project and I have a question about mysql databases. The application is multi-language. And we are wondering if you will get better performance if we split up different types of text-fields (varchar, text, med-text) to different tables? Or is it just better to create one table with just a text field?
With this question and the multi-language constraint in mind, I am wondering if the performance will rise of I split up the different types of text fields into seperate tables. Because when you just have one table with all the texts and the language you can search it easily. (Give me the text with this value (in an item column) and that language) When you have different tables for different types of text. You will save space in your database. Because you don't need a full text area for a varchar(200), but you will have multiple tables to create a connection between the item, the type of text and the languages you have for your text.
What do you think is the best? Or are there some possibilities that I didn't used?
I find it better for performance reasons to keep columns with blob and text data types in a separate able from the other data types even if it breaks normalization.
Consider a person table with columns name varchar, address varchar, dob date and picture blob. A picture can be about 1MB easily while the remaining columns may not take any more than 1KB. Imagine how many blokcs of data needed to be read even if you only want to list the name and address of people living in a certain city - if you are keeping everything in the same table.
If you are not bound to MySQL, I would suggest you to use some sort of text-searching engines, such as Apache Lucene if you want to do full-text searches. Because as far as I know, MySQL does not provide as much performance as Lucene can for full-text searches.
In case you are bound to MySQL, let me try to provide some information based on current definition of the problem (which is actually not much yet).
MySQL reference documentation states that:
Instances of BLOB or TEXT columns in the result of a query that is processed using a temporary table causes the server to use a table on disk rather than in memory because the MEMORY storage engine does not support those data types.
So, if you run your queries using SELECT * on a table that contains text field, you can either separate queries that really need the text field and the ones that don't need it to gain speed; or alternatively you can separate the text field from the table as well. Saving text field on a secondary table will cause you an extra overhead of the duplicated key storage and also the indexes for that secondary table. However according to your database design, you may also be suffering overhead for unnecessary index updates which can be eliminated by moving the text field to another table, but this is just a proposition since we don't know your schema and data access occasions.
I have a situation where I have to create tables dynamically. Depending on some criteria I am going to vary the size of the columns of a particular table.
For that purpose I need to calculate the size of one row.
e.g.
If I am going to create a following table
CREATE TABLE sample(id int, name varchar(30));
so that formula would give me the size of a single row for the table above considering all overheads for storing a row in a mysql table.
Is possible to do so and Is it feasible to do so?
It depends on the storage engine you use and the row format chosen for that table, and also your indexes. But it is not a very useful information.
Edit:
I suggest going against normalization only when you know exactly what you're doing. A DBMS is created to deal with large amount of data. You probably don't need to serialize your strctured data into a single field.
Keep in mind that your application layer then has to tokenie (or worse) the serialized field data to get the original meaning back, which has certainly larger overhead than getting the data already in a structured form, from the DB.
The only exeption I can think of is a client-heavy architcture, when moving processing to the client side actually takes burden off the server, and you would serialize our data anyway for the sake of the transfer. - In server-side code (like php) it is not a good practive to save serialized stye data into the DB.
(Though, using php's built in serialization may be a good idea in some cases. Your current project does not seem to benefit from it.)
The VARCHAR is a variable-length data type, it has a length property, but the value can be empty; calculation may be not exact. Have a look at 'Avg_row_length' field in information_schema.tables.