Single Column vs Multi Column Design (for Non Primary Key columns) - mysql

In database table design, which of the following is better design for event-log type of data growth
Design 1) Numeric columns(Long) and character columns (Varchar2) with
Index:
..(pkey)|..|..|StockNumber Long | StockDomain Varchar2 |...
.. |..|..|11111 | Finance
.. |..|..|23458 | Medical
Design 2) Character column Varchar2 with Index:
..(pkey)|..|..|StockDetails Varchar2(1000) |..|..
.. |..|..|11111;Finance |..|..
.. |..|..|23458;Medical |..|..
Design advantages: First design is very specific and Second design is more general which can accommodate more general data.In both the cases, columns indexed.
Storage: First design indexes require less storage than second
Performance: Same?
I am having a question about performance vs flexibility. Obviously, first design is better. But second design is the more general purpose. Let me know your insights
Note: Edited the question for more clarity.

In general, having discrete columns is the better way to go for a few reasons:
Datatypes - You have guarantees that the data you have saved is in the right formats, at least as far as non string columns go, your stockNumber will always be a number if it's a bigint/long, trying to set it to anything else will cause your insert/update to error. As part of a colon separated value (CSV) string there is a chance of bad data when it's part of a string.
Querying - Querying a single column has to be done using LIKE since you are looking for a substring of the single column string. If I look for WHERE StockDetails LIKE '%11111%' I will find the first line, but I may find another line where a dollar value inside that column, in a different field is $11111. With discrete columns your query would be WHERE StockNumber = 11111 guaranteeing it finds the data only in that column.
Using the data - Once you have found the row you're wanting, you then have to read the data. This means parsing out your CSV into separate fields. If one of those fields had a colon in it, and it is improperly escaped, the rest of the data is going to be parsed wrong, and you still need your values in a guaranteed same order, leaving blank sections ;; where you would have had a null value in a column.
There is a middle ground between storing CSVs and a separate columns. I have seen, and in fact am doing on one major project, data stored in a table as json. With json you have property names, so you don't care the order the fields appear in the string, because domain will still always be domain, any non standard fields you don't need in an entry (say a property that only exists for the medical domain) will just not be there rather than needing a blank double colon, and parsers for json exist in all languages I can think of that you would connect to your database, there's no need to manually code something to parse out your CSV string. For example your StockDetails given above would look like this:
+--------------------------------------+
| StockDetails |
+--------------------------------------+
| {"number":11111, "domain":"Finance"} |
| {"number":23458, "domain":"Medical"} |
+--------------------------------------+
This solves issues 2 and 3 above:
You now write your query as WHERE StockDetails LIKE '%"number":11111 including the json property name guarantees you don't find the data anywhere else in your string.
You don't need to worry about fields out of order, or missing in your string causing your data to be unusable, using json gives you the key/value pair, all you need to do is handle nulls where the key doesn't exist. This also lets you add fields easily, adding a new CSV field can break your code to parse it, the number of values will be off for your existing data, so you will need to update all rows potentially, however since in json you only store non null fields, a new field will be treated like any other null value on existing data.

In relational database design, you need discrete columns. One value per column per row.
This is the only way to use data types and constraints to implement some data integrity. In your second design, how would you implement a UNIQUE constraint on either StockNumber or StockDomain? How would you make sure StockNumber is actually a number?
This is the only way to create indexes on each column individually, or create a compound index that puts the StockDomain first.
As an analogy, look in the telephone book: can you find all people whose first name is "Bill" easily or efficiently? No, you have to search the whole book to find people with a specific first name. The order of columns in an index matters.
The second design is practically not a database at all — it's a file.
To respond to your comments, I'm reiterating what I wrote in a comment:
Sometimes denormalization is worthwhile, but I can't tell [if your second design is worthwhile], because you haven't described how you will query this data. You must take into account your query needs before you can decide on any optimization.
Stated another way: denormalization, like all other optimizations, benefits one query type, at the expense of other query types. Therefore you need to know which queries you need to be optimal, and which queries are less important, so it won't hurt your overall performance if the other queries are degraded.
If you can't predict the queries, default to designing a database with rules of normalization. Normalization is not designed for performance optimization, it's designed to prevent data anomalies, which is a good goal too.
You have posted several new comments, I guess in the hopes that I will suddenly understand and endorse your second design. But you still haven't described any specific query that will be optimized by using your second design.

Related

Does storing int value as varchar in mysql affect performance heavily?

I'm working on a website which should be multilingual and also in some products number of fields may be more than other products (for example may be in the future a products have an extra feature which old products doesn't have it). because of this problem I decided to have a product table with common fields which all products can have and in all languages are same (like width and height) and add another three tables for storing extra fields as below:
field (id,name)
field_name(field_id,lang_id,name)
field_value(product_id, field_id, lang_id, value)
by doing this I can fetch all the values from one table but the problem is that values can be in different types, for example it could be a number or a text. I checked on an open source project "Drupal" and in that they create a table for each field type and by doing joins they will retrieve a node data. I want to know which way will impact the performance more? having a table for each extra field or storing all of their value in one table and convert their type on the fly by casting?
thank you in advance
Yes, but no. You are storing your data in an entity-attribute-value form (EAV). This is rather inefficient in general. Here are some issues:
As you have written it, you cannot do type checking.
You cannot set-up foreign key relationships in the database.
Fetching the results for a single row requires multiple joins or a group by.
You cannot write indexes on a specific column to speed access.
There are some work-arounds. You can get around the typing issue by having separate columns for different types. So, the data structure would have:
Name
Type
ValueString
ValueInt
ValueDecimal
Or whatever types you want to support.
There are some other "tricks" if you want to go this route. The most important is to decimal align the numbers. So, instead of storing '1' and '10', you would store ' 1' and '10'. This makes the value more amenable to ordering.
When faced with such a problem, I often advocate a hybrid approach. This approach would have a fixed record with the important properties all nicely located in columns with appropriate types and indexes -- columns such as:
ProductReleaseDate
ProductDescription
ProductCode
And whatever values are most useful. An EAV table can then be used for additional properties that are optional. This generally balances the power of the relational database to handle structured data along with the flexibility of an EAV approach to support variable columns.

mysql table design, large column size vs large number of rows

Please help me understand which of the following is better for scaling and performance.
Table: test
columns: id <int, primary key>, doc <int>, keyword <string>
The data i want to store is a pointer to the documents containing a particular keyword
Design 1:
have unique constraint on the keyword column and store the list of documents as an array
e.g id: 1, doc: [4,5,6], keyword: google
Design 2:
insert a row for each document
1 4 google
2 5 google
3 6 google
Lets the say the average number of documents a particular keyword would be found in is close to 100000. there may not be a max number of documents the keyword appears in.
You can forget about option 1 because there's no array data type in mysql.
To be honest if you want a scallable solution for this type of data I think you should look into a different type of database. Research more on NoSQL and 'key-value pair store database'.
With mysql, the best I can think of is your 2nd option, with the exception that you should create another table with a numeric ID and a list of unique keywords. That way, when you do your search you'll first look up the ID, then filter the big table by the ID instead of string. Numeric comparison is faster than string comparison.
A lot of factors come into scaling and performance so it's not usually a good idea to try to optimise unknowns early in development.
For database design I find it's usually best to go with the more correct normalised approach (your design 2) and then worry about the scaling and performance if it becomes an issue. You can then de-normalise certain areas or take other approaches depending on what issues you face.
Your design option 1 is likely to hit other issues more immediately with the inability to join the doc column with another table, as well as complexities updating and searching it as well.
Design 1 is potentially limited by MySQL's row size limit.
Design 2 makes the most sense to me. What if you need to remove one of those values? You just delete a row rather than having to search through and update an array. It's also nice because it allows you to limit the size of your results if necessary (e.g., for pagination).
You might also consider creating a many-to-many relationship between this table and a keywords table instead of storing keywords as a field here.

MySQL: One row or more?

I have any kind of content what has an ID now here I can specify multiple types for the content.
The question is, should I use multiple rows to add multiple types or use the type field and put there the types separated with commas and parse them in PHP
Multiple Rows
`content_id` | `type`
1 | 1
1 | 2
1 | 3
VS
Single Row
`content_id` | `type`
1 | 1,2,3
EDIT
I'm looking for the faster answer, not the easier, please consider this. Performance is really important for me. So I'm talking about a really huge database with millions or ten millions of rows.
I'd generally always recommend the "multiple rows" approach as it has several advantages:
You can use SQL to return for example WHERE type=3 without any great difficulty as you don't have to use WHERE type LIKE '%3%', which is less efficient
If you ever need to store additional data against each content_id and type pair, you'll find it a lot easier in the multiple row version
You'll be able to apply one, or more, indexes to your table when it's stored in the "multiple row" format to improve the speed at which data is retrieved
It's easier to write a query to add/remove content_id and type pairs when each pair is stored separately than when you store them as a comma seaparated list
It'll (nearly) always be quicker to let SQL process the data to give you a subset than to pass it to PHP, or anything else, for processing
In general, let SQL do what it does best, which is allow you to store the data, and obtain subsets of the data.
I always use multiple rows. If you use single rows your data is hard to read and you have to split it up once you grab it from the database.
Use multiple rows. That way, you can index that type column later, and search it faster if you need to in the future. Also it removes a dependency on your front-end language to do parsing on query results.
Normalised vs de-normalised design.
usually I would recommend sticking to the "multiple rows" style (normalised)
Although sometimes (for performance/storage reasons) people deliberately implement "single row" style.
Have a look here:
http://www.databasedesign-resource.com/denormalization.html
The single row could be better in a few cases. Reporting tends to be easer with some denormalization is the main example. So if your code is cleaner/performs better with the single row, then go for that. Other wise the multiple rows would be the way to go.
Never, ever, ever cram multiple logical fields into a single field with comma separators.
The right way is to create multiple rows.
If there's some performance reason that demands you use a single row, at least make multiple fields in the row. But that said, there is almost never a good performance reason to do this. First make a good design.
Do you ever want to know all the records with, say, type=2? With multiple rows, this is easy: "select content_id from mytable where type=2". With the crammed field, you would have to say "select content_id from mytable where type like '%2%'". Oh, except what happens if there are more than 11 types? The above query would find "12". Okay, you could say "where type like '%,2,%'". Except that doesn't work if 2 is the first or the last in the list. Even if you came up with a way to do it reliably, a LIKE search with an initial % means a sequential read of every record in the table, which is very slow.
How big will you make the cram field? What if the string of types is too big to fit in your maximum?
Do you carry any data about the types? If you create a second table with key of "type" and, say, a description of that type, how will you join to that table. With multiple rows, you could simply write "select content_id, type_id, description from content join type using (type_id)". With a crammed field ... not so easy.
If you add a new type, how do you make it consistent? Suppose it used to say "3,7,9" and now you add "5". Can you say "3,7,9,5" ? Or do they have to be in order? If they're not in order, it's impossible to check for equality, because "1,2" and "2,1" will not look equal but they are really equivalent. In either case, updating a type field now becomes a program rather than a single SQL statement.
If there is some trivial performace gain, it's just not worth it.

MySQL database structure: more columns or more rows?

I'm creating an online dictionary and I have to use three different dictionaries for this purpose: everyday terms, chemical terms, computer terms. I have tree options:
1) Create three different tables, one table for each dictionary
2) Create one table with extra columns, i.e.:
id term dic_1_definition dic_2_definition dic_3_definition
----------------------------------------------------------------------
1 term1 definition
----------------------------------------------------------------------
2 term2 definition
----------------------------------------------------------------------
3 term3 definition
----------------------------------------------------------------------
4 term4 definition
----------------------------------------------------------------------
5 term5 definition definition
----------------------------------------------------------------------
etc.
3) Create one table with an extra "tag" column and tag all my terms depending on it's dictionary, i.e.:
id term definition tag
------------------------------------
1 term1 definition dic_1
2 term2 definition dic_2
3 term3 definition dic_3
4 term4 definition dic_2
5 term1 definition dic_2
etc.
A term can be related to one or more dictionaries, but have different definitions, let's say a term in everyday use can differ from the same term in IT field. That's why term1 (in my last) table can be assigned two tags - dic_1 (id 1) and dic_2 (id 5).
In future I'll add more dictionaries, so there probably will be more than three dics. I think if I'll use option 2 (with extra columns) I'll get in future one table and many many columns. I don't know if it's bad for performance or not.
Which option is the best approach in my case? Which one is faster? Why? Any suggestions and other options are greatly appreciated.
Thank you.
2) Create one table with extra column
You definitely shouldn't be using the 2nd approach. What if in the future you decide that you want 10 dictionaries? You would have to create an additional 10 columns which is madness..
What you should do is create a single table for all your dictionaries, and a single table for all your terms and a single table for all your definitions, that way all your data is grouped together in a logical fashion.
Then you can create a unique ID for each of your dictionaries, which is referenced in the terms table. Then all you need is a simple query to obtain the terms for a particular dictionary.
I think you should have a lookup table for your dictionary types
DictionaryType(DTId, DTName)
Have another Table for you terms
Terms(TermID, TermName)
Then your definitions
Difinitions(DifinitionId, TermID, Definition, DTId)
This should work.
Option 3 sounds like to most appropriate choice for your scenario. It makes the queries a little simpler and is definitely more maintainable in the long run.
Option 2 is definitely not the way to go because you will end up with a lot of null values and writing queries against such a table will be a nightmare.
Option 1 is not bad but before your application could query it has to deceide which table to query against and that could be a problem.
So option 3 would result in simple queries like:
Select term, definition from table where tag = 'dic_1'
You may even create another tag table to keep info about the tags themselves.
I have developed similar project and my design was as follows. Storing words, definitons and dictionaries in different tables is a flexible choice especially where you will add new dictionaries in future.
alt text http://img300.imageshack.us/img300/6550/worddict.png
Data Normalization .. I would go with 3, then you don't have to do any fancy queries to identify how many definitions are applicable per a given term
There's always an "it depends..."
Having said that, option 2 will usually be a bad choice - both from the purist perspective (Data Normalisation) and the practical perspective - you have to alter the table definition to add a new dictionary (or remove an old one)
If your main access is always going to be looking for a matching term, and the dictionary name ('everyday', 'chemical', 'geek') is an attribute, then option 3 makes sense.
If on the other hand your access is always primarily by dictionary type as well as term, and dictionary 1 is huge but rarely used, while dictionaries 2..n are small but commonly used, then option 1 might make more sense (or option 1a => 1 table for rarely used dictionaries, another for heavily used dictionaries)... this is a very hypothetical case !
Your database structure should contain data, the structure itself should not be data. This rules out option 2 immediately, unless you create the different tables in order to build separate applications running on the different dictionaries. If they are being shared, then it is the wrong way to do it.
Option 1 requires a database modification and queries to be rewritten in order to accommodate addition of new dictionaries. It also adds excessive complication to simple queries, such as "what dictionaries are this word in?"
Option 3 Is the most flexible and best choice here. If your data grows too large you can eventually use DB side details like table partitioning to speed up things.
You want to fetch data based on the dictionary type, that means that the dictionary type is data.
Data should be in the fields of the tables, not as table names or field names. If you don't have the data in the fields, you have a data model that needs changes if the data chances, and you need to create queries dynamically to get the data.
The first option uses the dictionary type as table names.
The second option uses the dictionary type as field names.
The third option correctly places the dictionary type as data in a field.
However, the term and the tag should not be strings, they should rather be foreign keys to tables where the terms and dictionary types are defined.
The requirements here are far too vague, resulting in the 'accepted answer' being totally over-'solved'. The requirements need to provide more information about how the dictionaries will be used.
That said, working off the little provided; I'd go with a variation on #3.
Number 1 is perfectly viable if the dictionaries will be used entirely independently, and the only reason the concept of shared terms was mentioned is that it just happens to be a coincidental possibility.
Ditch 2; it unnecessarily leads to NULL values in columns, and DB designs don't like that.
Number 3 is the best, but ditch the artificial key, and key on Term + Tag. Apart from the artificial key creating the possibility of duplicate entries (by Term + Tag). If no other tables reference TermDefinitions, the key is a waste; if something does; then they say (for example) "I'm referencing TermDefinition #3... Uhhm, whatever that is. :S"
In a nutshell, nothing provided so far in the requirement indicates any need for anything more complicated than option 3.

How does a hash table work? Is it faster than "SELECT * from .."

Let's say, I have :
Key | Indexes | Key-values
----+---------+------------
001 | 100001 | Alex
002 | 100002 | Micheal
003 | 100003 | Daniel
Lets say, we want to search 001, how to do the fast searching process using hash table?
Isn't it the same as we use the "SELECT * from .. " in mysql? I read alot, they say, the "SELECT *" searching from beginning to end, but hash table is not? Why and how?
By using hash table, are we reducing the records we are searching? How?
Can anyone demonstrate how to insert and retrieve hash table process in mysql query code? e.g.,
SELECT * from table1 where hash_value="bla" ...
Another scenario:
If the indexes are like S0001, S0002, T0001, T0002, etc. In mysql i could use:
SELECT * from table WHERE value = S*
isn't it the same and faster?
A simple hash table works by keeping the items on several lists, instead of just one. It uses a very fast and repeatable (i.e. non-random) method to choose which list to keep each item on. So when it is time to find the item again, it repeats that method to discover which list to look in, and then does a normal (slow) linear search in that list.
By dividing the items up into 17 lists, the search becomes 17 times faster, which is a good improvement.
Although of course this is only true if the lists are roughly the same length, so it is important to choose a good method of distributing the items between the lists.
In your example table, the first column is the key, the thing we need to find the item. And lets suppose we will maintain 17 lists. To insert something, we perform an operation on the key called hashing. This just turns the key into a number. It doesn't return a random number, because it must always return the same number for the same key. But at the same time, the numbers must be "spread out" widely.
Then we take the resulting number and use modulus to shrink it down to the size of our list:
Hash(key) % 17
This all happens extremely fast. Our lists are in an array, so:
_lists[Hash(key % 17)].Add(record);
And then later, to find the item using that key:
Record found = _lists[Hash(key % 17)].Find(key);
Note that each list can just be any container type, or a linked list class that you write by hand. When we execute a Find in that list, it works the slow way (examine the key of each record).
Do not worry about what MySQL is doing internally to locate records quickly. The job of a database is to do that sort of thing for you. Just run a SELECT [columns] FROM table WHERE [condition]; query and let the database generate a query plan for you. Note that you don't want to use SELECT *, since if you ever add a column to the table that will break all your old queries that relied on there being a certain number of columns in a certain order.
If you really want to know what's going on under the hood (it's good to know, but do not implement it yourself: that is the purpose of a database!), you need to know what indexes are and how they work. If a table has no index on the columns involved in the WHERE clause, then, as you say, the database will have to search through every row in the table to find the ones matching your condition. But if there is an index, the database will search the index to find the exact location of the rows you want, and jump directly to them. Indexes are usually implemented as B+-trees, a type of search tree that uses very few comparisons to locate a specific element. Searching a B-tree for a specific key is very fast. MySQL is also capable of using hash indexes, but these tend to be slower for database uses. Hash indexes usually only perform well on long keys (character strings especially), since they reduce the size of the key to a fixed hash size. For data types like integers and real numbers, which have a well-defined ordering and fixed length, the easy searchability of a B-tree usually provides better performance.
You might like to look at the chapters in the MySQL manual and PostgreSQL manual on indexing.
http://en.wikipedia.org/wiki/Hash_table
Hash tables may be used as in-memory data structures. Hash tables may also be adopted for use with persistent data structures; database indices sometimes use disk-based data structures based on hash tables, although balanced trees are more popular.
I guess you could use a hash function to get the ID you want to select from. Like
SELECT * FROM table WHERE value = hash_fn(whatever_input_you_build_your_hash_value_from)
Then you don't need to know the id of the row you want to select and can do an exact query. Since you know that the row will always have the same id because of the input you build the hash value form and you can always recreate this id through the hash function.
However this isn't always true depending on the size of the table and the maximum number of hashvalues (you often have "X mod hash-table-size" somewhere in your hash). To take care of this you should have a deterministic strategy you use each time you get two values with the same id. You should check Wikipedia for more info on this strategy, its called collision handling and should be mentioned in the same article as hash-tables.
MySQL probably uses hashtables somewhere because of the O(1) feature norheim.se (up) mentioned.
Hash tables are great for locating entries at O(1) cost where the key (that is used for hashing) is already known. They are in widespread use both in collection libraries and in database engines. You should be able to find plenty of information about them on the internet. Why don't you start with Wikipedia or just do a Google search?
I don't know the details of mysql. If there is a structure in there called "hash table", that would probably be a kind of table that uses hashing for locating the keys. I'm sure someone else will tell you about that. =)
EDIT: (in response to comment)
Ok. I'll try to make a grossly simplified explanation: A hash table is a table where the entries are located based on a function of the key. For instance, say that you want to store info about a set of persons. If you store it in a plain unsorted array, you would need to iterate over the elements in sequence in order to find the entry you are looking for. On average, this will need N/2 comparisons.
If, instead, you put all entries at indexes based on the first character of the persons first name. (A=0, B=1, C=2 etc), you will immediately be able to find the correct entry as long as you know the first name. This is the basic idea. You probably realize that some special handling (rehashing, or allowing lists of entries) is required in order to support multiple entries having the same first letter. If you have a well-dimensioned hash table, you should be able to get straight to the item you are searching for. This means approx one comparison, with the disclaimer of the special handling I just mentioned.