I learned that there is no concept of order in terms of tuples (e.g. rows) in a table but according to wikipedia "a tuple is an ordered list of elements". Does that mean that attributes do have an order? If yes why would they be treated differently, couldn't one add another column to a table (which is why the tuples don't have order)?
"In this notation, attribute–value pairs may appear in any order." does this mean attributes have no order?
There are 2 kinds of tuples, so to speak. There is "pure mathematics", and there indeed tuples are typically defined as "ordered lists of values". Thus, in mathematical theories, it makes sense to speak of "the first value in a tuple" and such. This may be the sense or the context that your Wikipedia article is referring to.
The Haskell language supports this kind of tuple and, e.g., it also has a fst() operator to extract the "first" value out of such a tuple.
Codd realized, however, that this would be extremely impractical when applying this mathematical concept of tuples-as-ordered-lists to the field of data management. In data management, he wanted addressability of the values by attribute name, rather than by ordinal position. Indeed, imagine the devastating consequences if "the second attribute out of five" is removed from a table, and now all the programs that address "the third" and "the fourth" attribute of that same table now have to be inventarised and adapted.
So in the relational model, tuples are sets-of-named-values instead, and consequently, in the kind of tuples that play a role in the relational model of data, there is indeed not any concept of ordering of the values.
And then as indicated in that other response, there is SQL and its blasphemous deviations from relational theory. In SQL, the ordering of attributes in tuples and headings is quite meaningful, and the consequences are all over the place. In FK declarations, correspondance of the respective referring and referred attributes is by ordinal position, not by name. Other cases are with UNIONs and EXCEPTs. Take a table T, columns X and Y the same type.
SELECT X,Y FROM T UNION SELECT Y,X FROM T
is not invalid per se, but the standard prescribes that the column names in the result are system-defined (!). Implementations that "do the sensible thing" and deviate from this, producing a table with columns named X and Y, respectively, then face their users with the consequence that the former expression is not identical to
SELECT Y,X FROM T UNION SELECT X,Y FROM T
(because the column ordering X,Y is another ordering than Y,X and thence the headings are unequal, and consequently the tables are unequal.)
SELECT X,Y FROM T EXCEPT SELECT Y,X FROM T
gives results that will leave many novice SQL users scratching their heads for quite a while.
The operations of the relational database model as it is normally understood and used certainly do not depend on the order of attributes in a relation. The attributes of a relation variable can always be identified by name rather than position. However, the notation used in relational database theory doesn't always specify attribute names (or attribute types) and sometimes does imply ordered attributes. This is primarily a matter of written notation rather than the structure or behaviour of the relational model. There's some more discussion of these different "named" and "ordered" perspectives in the following reference.
http://webdam.inria.fr/Alice/pdfs/Chapter-3.pdf
E.F.Codd's earliest papers on the relational model actually proposed a relational system supporting both ordered and unordered versions of relations simultaneously. That idea seems to have been forgotten or ignored ever since, even by Codd himself. It doesn't form part of modern relational database theory.
Unfortunately, SQL definitely does have the concept of column order. It is an essential part of SQL syntax and semantics and it's something that every SQL DBMS supports in order to comply with the ISO SQL standard. SQL is not relational and this is just one of the ways in which it differs from the relational model. The awkward differences between the SQL model and the relational one cause a certain amount of confusion for students of the relational model who also use SQL.
Mathematical and philosophical arguments aside, look at it practically with some real-world examples.
Suppose you write this SQL:
INSERT INTO mytable
SELECT
a1, a2, ... a99
FROM anothertable;
Suppose this works just fine, because the order in which the SELECT returns the 99 columns is the same at each run, and matches the order of the colums needed for mytable.
But suppose that I am given the task to review this code, written by my colleague. How do you review if the columns are in the correct order? I will have to do some digging in other parts of the system, where mytable is constructed, to check if the columns are in the correct order. So the correctness of this SQL depends on other code, perhaps in other code in far and obscure places.
Case 2: suppose, the hypothetical list of columns looks like this:
...
a63,
apt_inspecties.buurtcode_dominant,
apt_inspecties.buurtnaam_dominant,
--
apt_inspecties.buurtnaam,
apt_inspecties.buurtcode,
--
apt_inspecties.wijknaam_dominant,
apt_inspecties.wijkcode_dominant,
--
apt_inspecties.wijknaam,
apt_inspecties.wijkcode,
--
apt_inspecties.stadsdeelnaam_dominant,
apt_inspecties.stadsdeelcode_dominant,
--
apt_inspecties.stadsdeelnaam,
apt_inspecties.stadsdeelcode,
--
apt_inspecties.ggwnaam_dominant,
apt_inspecties.ggwcode_dominant,
--
apt_inspecties.ggwnaam,
apt_inspecties.ggwcode,
a80,
a81,
...
Then there will be someone sometime who wants to reorder the first naam/code lines to get the list in a more systematic order with 8 times naam then code order, like this:
a63,
apt_inspecties.buurtnaam_dominant,
apt_inspecties.buurtcode_dominant,
But this would not be possible, unless, in other code the order of attributes is also changed. But then there is the risk that again other code that also relies on the implicit order of attributes goes wrong.
The obvious solution is to ALWAYS practice defensive coding like:
INSERT INTO mytable(a1, ... a99)
SELECT a1...a99
FROM anothertable;
Or, to assume that you cannot rely on implicit constant order of attributes.
Case 3:
Suppose mytable gets lost and needs to be recreated, for example from backup data. And suppose that it is recreated with another ordering of attributes. Or, someone does an ALTER TABLE to remove one attribute, and later another ALTER TABLE to add this attribute back in, thereby changing the order of attributes.
I believe that everybody would agree that the resulting table is the same as the original table. You would look at the data and see that the relation between a1 and a2 (etc) is the same. Same number of rows.
If you do any query like
SELECT ai, aj FROM mytable;
the output will be what you would expect from the original table.
However, the above SQL would fail, because it relies on one particular implicit ordering of attributes.
Conclusion:
For a relational database, such as Oracle, Postgresql, MySql, whatever, the ordering of rows and of attributes has no meaning. Name the attributes in your SQL to force the desired order of attributes, and use an ORDER BY clause to force the ordering of rows.
This is related to strong typing in programming languages. Not everybody likes that, and some modern languages do not have strong typing, like javascript, python. However, for writing larger code, strong typing is very important, and then typescript seems to replace (supplement) javascript.
Related
I have a MySQL table like this, and I want to create indexes that make all queries to the table run fast. The difficult thing is that there are many possible combinations of where conditions, and that the size of table is large (about 6M rows).
Table name: items
id: PKEY
item_id: int (the id of items)
category_1: int
category_2: int
.
.
.
category_10: int
release_date: date
sort_score: decimal
item_id is not unique because an item can have several numbers of category_x .
An example of queries to this table is:
SELECT DISTINCT(item_id) FROM items WHERE category_1 IN (1, 2) AND category_5 IN (3, 4), AND release_date > '2019-01-01' ORDER BY sort_score
And another query maybe:
SELECT DISTINCT(item_id) FROM items WHERE category_3 IN (1, 2) AND category_4 IN (3, 4), AND category_8 IN (5) ORDER BY sort_score
If I want to optimize all the combinations of where conditions , do I have to make a huge number of composite indexes of the column combinations? (like ADD INDEX idx1_3_5(category_1, category_3, category_5))
Or is it good to create 10 tables which have data of category_1~10, and execute many INNER JOIN in the queries?
Or, is it difficult to optimize this kind of queries in MySQL, and should I use other middlewares , such as Elasticsearch ?
Well, the file (it is not a table) is not at all Normalised. Therefore no amount indices on combinations of fields will help the queries.
Second, MySQL is (a) not compliant with the SQL requirement, and (b) it does not have a Server Architecture or the features of one.
Such a Statistics, which is used by a genuine Query Optimiser, which commercial SQL platforms have. The "single index" issue you raise in the comments does not apply.
Therefore, while we can fix up the table, etc, you may never obtain the performance that you seek from the freeware.
Eg. in the commercial world, 6M rows is nothing, we worry when we get to a billion rows.
Eg. Statistics is automatic, we have to tweak it only when necessary: an un-normalised table or billions of rows.
Or ... should I use other middlewares , such as Elasticsearch ?
It depends on the use of genuine SQL vs MySQL, and the middleware.
If you fix up the file and make a set of Relational tables, the queries are then quite simple, and fast. It does not justify a middleware search engine (that builds a data cube on the client system).
If they are not fast on MySQL, then the first recommendation would be to get a commercial SQL platform instead of the freeware.
The last option, the very last, is to stick to the freeware and add a big fat middleware search engine to compensate.
Or is it good to create 10 tables which have data of category_1~10, and execute many INNER JOIN in the queries?
Yes. JOINs are quite ordinary in SQL. Contrary to popular mythology, a normalised database, which means many more tables than an un-normalised one, causes fewer JOINs, not more JOINs.
So, yes, Normalise that beast. Ten tables is the starting perception, still not at all Normalised. One table for each of the following would be a step in the direction of Normalised:
Item
Item_id will be unique.
Category
This is not category-1, etc, but each of the values that are in category_1, etc. You must not have multiple values in a single column, it breaks 1NF. Such values will be (a) Atomic, and (b) unique. The Relational Model demands that the rows are unique.
The meaning of category_1, etc in Item is not given. (If you provide some example data, I can improve the accuracy of the data model.) Obviously it is not [2].
.
If it is a Priority (1..10), or something similar, that the users have chosen or voted on, this table will be a table that supplies the many-to-many relationship between Item and Category, with a Priority for each row.
.
Let's call it Poll. The relevant Predicates would be something like:
Each Poll is 1 Item
Each Poll is 1 Priority
Each Poll is 1 Category
Likewise, sort_score is not explained. If it is even remotely what it appears to be, you will not need it. Because it is a Derived Value. That you should compute on the fly: once the tables are Normalised, the SQL required to compute this is straight-forward. Not one that you compute-and-store every 5 minutes or every 10 seconds.
The Relational Model
The above maintains the scope of just answering your question, without pointing out the difficulties in your file. Noting the Relational Database tag, this section deals with the Relational errors.
The Record ID field (item_id or category_id is yours) is prohibited in the Relational Model. It is a physical pointer to a record, which is explicitly the very thing that the RM overcomes, and that is required to be overcome if one wishes to obtain the benefits of the RM, such as ease of queries, and simple, straight-forward SQL code.
Conversely, the Record ID is always one additional column and one additional index, and the SQL code required for navigation becomes complex (and buggy) very quickly. You will have enough difficulty with the code as it is, I doubt you would want the added complexity.
Therefore, get rid of the Record ID fields.
The Relational Model requires that the Keys are "made up from the data". That means something from the logical row, that the users use. Usually they know precisely what identifies their data, such as a short name.
It is not manufactured by the system, such as a RecordID field which is a GUID or AUTOINCREMENT, which the user does not see. Such fields are physical pointers to records, not Keys to logical rows. Such fields are pre-Relational, pre-DBMS, 1960's Record Filing Systems, the very thing that RM superseded. But they are heavily promoted and marketed as "relational.
Relational Data Model • Initial
Looks like this.
All my data models are rendered in IDEF1X, the Standard for modelling Relational databases since 1993
My IDEF1X Introduction is essential reading for beginners.
Relational Data Model • Improved
Ternary relations (aka three-way JOINs) are known to be a problem, indicating that further Normalisation is required. Codd teaches that every ternary relation can be reduced to two binary relations.
In your case, perhaps a Item has certain, not all, Categories. The above implements Polls of Items allowing all Categories for each Item, which is typical error in a ternary relation, which is why it requires further Normalisation. It is also the classic error in every RFS file.
The corrected model would therefore be to establish the Categories for each Item first as ItemCategory, your "item can have several numbers of category_x". And then to allow Polls on that constrained ItemCategory. Note, this level of constraining data is not possible in 1960' Record Filing Systems, in which the "key" is a fabricated id field:
Each ItemCategory is 1 Item
Each ItemCategory is 1 Category
Each Poll is 1 Priority
Each Poll is 1 ItemCategory
Your indices are now simple and straight-forward, no additional indices are required.
Likewise your query code will now be simple and straight-forward, and far less prone to bugs.
Please make sure that you learn about Subqueries. The Poll table supports any type of pivoting that may be required.
It is messy to optimize such queries against such a table. Moving the categories off to other tables would only make it slower.
Here's a partial solution... Identify the categories that are likely to be tested with
=
IN
a range, such as your example release_date > '2019-01-01'
Then devise a few indexes (perhaps no more than a dozen) that have, say, 3-4 columns. Those columns should be ones that are often tested together. Order the columns in the indexes based on the list above. It is quite fine to have multiple = columns (first), but don't include more than one 'range' (last).
Keep in mind that the order of tests in WHERE does not matter, but the order of the columns in an INDEX does.
In database table design, which of the following is better design for event-log type of data growth
Design 1) Numeric columns(Long) and character columns (Varchar2) with
Index:
..(pkey)|..|..|StockNumber Long | StockDomain Varchar2 |...
.. |..|..|11111 | Finance
.. |..|..|23458 | Medical
Design 2) Character column Varchar2 with Index:
..(pkey)|..|..|StockDetails Varchar2(1000) |..|..
.. |..|..|11111;Finance |..|..
.. |..|..|23458;Medical |..|..
Design advantages: First design is very specific and Second design is more general which can accommodate more general data.In both the cases, columns indexed.
Storage: First design indexes require less storage than second
Performance: Same?
I am having a question about performance vs flexibility. Obviously, first design is better. But second design is the more general purpose. Let me know your insights
Note: Edited the question for more clarity.
In general, having discrete columns is the better way to go for a few reasons:
Datatypes - You have guarantees that the data you have saved is in the right formats, at least as far as non string columns go, your stockNumber will always be a number if it's a bigint/long, trying to set it to anything else will cause your insert/update to error. As part of a colon separated value (CSV) string there is a chance of bad data when it's part of a string.
Querying - Querying a single column has to be done using LIKE since you are looking for a substring of the single column string. If I look for WHERE StockDetails LIKE '%11111%' I will find the first line, but I may find another line where a dollar value inside that column, in a different field is $11111. With discrete columns your query would be WHERE StockNumber = 11111 guaranteeing it finds the data only in that column.
Using the data - Once you have found the row you're wanting, you then have to read the data. This means parsing out your CSV into separate fields. If one of those fields had a colon in it, and it is improperly escaped, the rest of the data is going to be parsed wrong, and you still need your values in a guaranteed same order, leaving blank sections ;; where you would have had a null value in a column.
There is a middle ground between storing CSVs and a separate columns. I have seen, and in fact am doing on one major project, data stored in a table as json. With json you have property names, so you don't care the order the fields appear in the string, because domain will still always be domain, any non standard fields you don't need in an entry (say a property that only exists for the medical domain) will just not be there rather than needing a blank double colon, and parsers for json exist in all languages I can think of that you would connect to your database, there's no need to manually code something to parse out your CSV string. For example your StockDetails given above would look like this:
+--------------------------------------+
| StockDetails |
+--------------------------------------+
| {"number":11111, "domain":"Finance"} |
| {"number":23458, "domain":"Medical"} |
+--------------------------------------+
This solves issues 2 and 3 above:
You now write your query as WHERE StockDetails LIKE '%"number":11111 including the json property name guarantees you don't find the data anywhere else in your string.
You don't need to worry about fields out of order, or missing in your string causing your data to be unusable, using json gives you the key/value pair, all you need to do is handle nulls where the key doesn't exist. This also lets you add fields easily, adding a new CSV field can break your code to parse it, the number of values will be off for your existing data, so you will need to update all rows potentially, however since in json you only store non null fields, a new field will be treated like any other null value on existing data.
In relational database design, you need discrete columns. One value per column per row.
This is the only way to use data types and constraints to implement some data integrity. In your second design, how would you implement a UNIQUE constraint on either StockNumber or StockDomain? How would you make sure StockNumber is actually a number?
This is the only way to create indexes on each column individually, or create a compound index that puts the StockDomain first.
As an analogy, look in the telephone book: can you find all people whose first name is "Bill" easily or efficiently? No, you have to search the whole book to find people with a specific first name. The order of columns in an index matters.
The second design is practically not a database at all — it's a file.
To respond to your comments, I'm reiterating what I wrote in a comment:
Sometimes denormalization is worthwhile, but I can't tell [if your second design is worthwhile], because you haven't described how you will query this data. You must take into account your query needs before you can decide on any optimization.
Stated another way: denormalization, like all other optimizations, benefits one query type, at the expense of other query types. Therefore you need to know which queries you need to be optimal, and which queries are less important, so it won't hurt your overall performance if the other queries are degraded.
If you can't predict the queries, default to designing a database with rules of normalization. Normalization is not designed for performance optimization, it's designed to prevent data anomalies, which is a good goal too.
You have posted several new comments, I guess in the hopes that I will suddenly understand and endorse your second design. But you still haven't described any specific query that will be optimized by using your second design.
All popular SQL databases, that I am aware of, implement foreign keys efficiently by indexing them.
Assuming a N:1 relationship Student -> School, the school id is stored in the student table with a (sometimes optional) index. For a given student you can find their school just looking up the school id in the row, and for a given school you can find its students by looking up the school id in the index over the foreign key in Students. Relational databases 101.
But is that the only sensible implementation? Imagine you are the database implementer, and instead of using a btree index on the foreign key column, you add an (invisible to the user) set on the row at the other (many) end of the relation. So instead of indexing the school id column in students, you had an invisible column that was a set of student ids on the school row itself. Then fetching the students for a given school is a simple as iterating the set. Is there a reason this implementation is uncommon? Are there some queries that can't be supported efficiently this way? The two approaches seem more or less equivalent, modulo particular implementation details. It seems to me you could emulate either solution with the other.
In my opinion it's conceptually the same as splitting of the btree, which contains sorted runs of (school_id, student_row_id), and storing each run on the school row itself. Looking up a school id in the school primary key gives you the run of student ids, the same as looking up a school id in the foreign key index would have.
edited for clarity
You seem to be suggesting storing "comma separated list of values" as a string in a character column of a table. And you say that it's "as simple as iterating the set".
But in a relational database, it turns out that "iterating the set" when its stored as list of values in a column is not at all simple. Nor is it efficient. Nor does it conform to the relational model.
Consider the operations required when a member needs to be added to a set, or removed from the set, or even just determining whether a member is in a set. Consider the operations that would be required to enforce integrity, to verify that every member in that "comma separated list" is valid. The relational database engine is not going to help us out with that, we'll have to code all of that ourselves.
At first blush, this idea may seem like a good approach. And it's entirely possible to do, and to get some code working. But once we move beyond the trivial demonstration, into the realm of real problems and real world data volumes, it turns out to be a really, really bad idea.
The storing comma separated lists is all-too-familiar SQL anti-pattern.
I strongly recommend Chapter 2 of Bill Karwin's excellent book: SQL Antipatterns: Avoiding the Pitfalls of Database Programming ISBN-13: 978-1934356555
(The discussion here relates to "relational database" and how it is designed to operate, following the relational model, the theory developed by Ted Codd and Chris Date.)
"All nonkey columns are dependent on the key, the whole key, and nothing but the key. So help me Codd."
Q: Is there a reason this implementation is uncommon?
Yes, it's uncommon because it flies in the face of relational theory. And it makes what would be a straightforward problem (for the relational model) into a confusing jumble that the relational database can't help us with. If what we're storing is just a string of characters, and the database never needs to do anything with that, other than store the string and retrieve the string, we'd be good. But we can't ask the database to decipher that as representing relationships between entities.
Q: Are there some queries that can't be supported efficiently this way?
Any query that would need to turn that "list of values" into a set of rows to be returned would be inefficient. Any query that would need to identify a "list of values" containing a particular value would be inefficient. And operations to insert or remove a value from the "list of values" would be inefficient.
This might buy you some small benefit in a narrow set of cases. But the drawbacks are numerous.
Such indices are useful for more than just direct joins from the parent record. A query might GROUP BY the FK column, or join it to a temp table / subquery / CTE; all of these cases might benefit from the presence of an index, but none of the queries involve the parent table.
Even direct joins from the parent often involve additional constraints on the child table. Consequently, indices defined on child tables commonly include other fields in addition to the key itself.
Even if there appear to be fewer steps involved in this algorithm, that does not necessarily equate to better performance. Databases don't read from disk a column at a time; they typically load data in fixed-size blocks. As a result, storing this information in a contiguous structure may allow it to be accessed far more efficiently than scattering it across multiple tuples.
No database that I'm aware of can inline an arbitrarily large column; either you'd have a hard limit of a few thousand children, or you'd have to push this list to some out-of-line storage (and with this extra level of indirection, you've probably lost any benefit over an index lookup).
Databases are not designed for partial reads or in-place edits of a column value. You would need to fetch the entire list whenever it's accessed, and more importantly, replace the entire list whenever it's modified.
In fact, you'd need to duplicate the entire row whenever the child list changes; the MVCC model handles concurrent modifications by maintaining multiple versions of a record. And not only are you spawning more versions of the record, but each version holds its own copy of the child list.
Probably most damning is the fact that an insert on the child table now triggers an update of the parent. This involves locking the parent record, meaning that concurrent child inserts or deletes are no longer allowed.
I could go on. There might be mitigating factors or obvious solutions in many of these cases (not to mention outright misconceptions on my part), though there are probably just as many issues that I've overlooked. In any case, I'm satisfied that they've thought this through fairly well...
Any two tuples in a relation have different values on at least one
attribute.
Tuples belonging to the same relation are stored in arbitrary order.
The programmer can specify the tuples in the same relation to be displayed in a particular order according to the values of one or more attributes.
I would say... True, False, True....
I would like to hear your guys opinions because you people have been in this field longer than I have :)
Thank you guys in advance :)
I don't know of any RDBMS that forces the developer to have non-duplicate rows (tuples). So, in practice, the answers are False, True*, and True.
*Again, in practice, the answer to #2 might be False in some circumstances depending on the RDBMS, but it doesn't have to be and the developer should assume that #2 is True.
True. Two rows that are completely identical must represent, logically, the same tuple. A tuple in a relation (or a row in a table) is basically a statement of truth, e.g. the row ('John Smith',1/1/2000) might mean "The employee named John Smith started employment on 1/1/2000." If a table in an RDBMS has the rows ('John Smith',1/1/2000) and ('John Smith',1/1/2000), this doesn't make the statement any more true - there is only one tuple represented here.* The relational model doesn't say you can't store multiple copies of a row; it only says that if you change one copy, you must make the same change to all the other copies as well; and when you query it, only one of those copies should be used. In practice, it's more convenient/performant to enforce uniqueness with a constraint and only store one physical copy.
True (or more accurately: not applicable). The relational model cares not in which order truth statements are stored. The order in which tuples are represented is immaterial to their truth value.
Also "not applicable". How results of queries on a relational database are presented to users is outside the scope of the relational model. It really couldn't care less.
(* don't tell me there might be two people named "John Smith" who started work on that particular date; in the relational model, you must find something different about those two poor fellows to distinguish them to the database, otherwise, logically, those rows must be referring to exactly the same truth statement - if not, there are update and delete anomalies that cannot be resolved - e.g. if "John Smith" suddenly gets sick of being mistaken for that other John Smith, and resigns, how would you write the UPDATE statement to set his resignation date?)
Any two tuples in a relation have different values on at least one attribute.
True. This is a requirement of the relational model.
Tuples belonging to the same relation are stored in arbitrary order.
Neither true nor false.
It is true that relations have no tuple ordering by definition. However, a relation is a not 'stored'. The quiz master probably meant to say relation variable (relvar).
It is true that the relational model has nothing to say on whether a relvar's tuples are stored in a particularly order; rather this is a feature of the DBMS. Most SQL products that are based on contiguous storage allow users to specify a clustered index for a base table that will influence physical ordering on disk. However, the SQL standard has nothing to say about such indexes. There is no reason why relational DBMS couldn't have a similar feature.
The programmer can specify the tuples in the same relation to be
displayed in a particular order according to the values of one or more
attributes.
The statement is not true of a relational database language. If a database language had such a feature then the result would not be a relation (it wouldn't be displaying tuples) and therefore the language in question would not be relational.
I'm creating an online dictionary and I have to use three different dictionaries for this purpose: everyday terms, chemical terms, computer terms. I have tree options:
1) Create three different tables, one table for each dictionary
2) Create one table with extra columns, i.e.:
id term dic_1_definition dic_2_definition dic_3_definition
----------------------------------------------------------------------
1 term1 definition
----------------------------------------------------------------------
2 term2 definition
----------------------------------------------------------------------
3 term3 definition
----------------------------------------------------------------------
4 term4 definition
----------------------------------------------------------------------
5 term5 definition definition
----------------------------------------------------------------------
etc.
3) Create one table with an extra "tag" column and tag all my terms depending on it's dictionary, i.e.:
id term definition tag
------------------------------------
1 term1 definition dic_1
2 term2 definition dic_2
3 term3 definition dic_3
4 term4 definition dic_2
5 term1 definition dic_2
etc.
A term can be related to one or more dictionaries, but have different definitions, let's say a term in everyday use can differ from the same term in IT field. That's why term1 (in my last) table can be assigned two tags - dic_1 (id 1) and dic_2 (id 5).
In future I'll add more dictionaries, so there probably will be more than three dics. I think if I'll use option 2 (with extra columns) I'll get in future one table and many many columns. I don't know if it's bad for performance or not.
Which option is the best approach in my case? Which one is faster? Why? Any suggestions and other options are greatly appreciated.
Thank you.
2) Create one table with extra column
You definitely shouldn't be using the 2nd approach. What if in the future you decide that you want 10 dictionaries? You would have to create an additional 10 columns which is madness..
What you should do is create a single table for all your dictionaries, and a single table for all your terms and a single table for all your definitions, that way all your data is grouped together in a logical fashion.
Then you can create a unique ID for each of your dictionaries, which is referenced in the terms table. Then all you need is a simple query to obtain the terms for a particular dictionary.
I think you should have a lookup table for your dictionary types
DictionaryType(DTId, DTName)
Have another Table for you terms
Terms(TermID, TermName)
Then your definitions
Difinitions(DifinitionId, TermID, Definition, DTId)
This should work.
Option 3 sounds like to most appropriate choice for your scenario. It makes the queries a little simpler and is definitely more maintainable in the long run.
Option 2 is definitely not the way to go because you will end up with a lot of null values and writing queries against such a table will be a nightmare.
Option 1 is not bad but before your application could query it has to deceide which table to query against and that could be a problem.
So option 3 would result in simple queries like:
Select term, definition from table where tag = 'dic_1'
You may even create another tag table to keep info about the tags themselves.
I have developed similar project and my design was as follows. Storing words, definitons and dictionaries in different tables is a flexible choice especially where you will add new dictionaries in future.
alt text http://img300.imageshack.us/img300/6550/worddict.png
Data Normalization .. I would go with 3, then you don't have to do any fancy queries to identify how many definitions are applicable per a given term
There's always an "it depends..."
Having said that, option 2 will usually be a bad choice - both from the purist perspective (Data Normalisation) and the practical perspective - you have to alter the table definition to add a new dictionary (or remove an old one)
If your main access is always going to be looking for a matching term, and the dictionary name ('everyday', 'chemical', 'geek') is an attribute, then option 3 makes sense.
If on the other hand your access is always primarily by dictionary type as well as term, and dictionary 1 is huge but rarely used, while dictionaries 2..n are small but commonly used, then option 1 might make more sense (or option 1a => 1 table for rarely used dictionaries, another for heavily used dictionaries)... this is a very hypothetical case !
Your database structure should contain data, the structure itself should not be data. This rules out option 2 immediately, unless you create the different tables in order to build separate applications running on the different dictionaries. If they are being shared, then it is the wrong way to do it.
Option 1 requires a database modification and queries to be rewritten in order to accommodate addition of new dictionaries. It also adds excessive complication to simple queries, such as "what dictionaries are this word in?"
Option 3 Is the most flexible and best choice here. If your data grows too large you can eventually use DB side details like table partitioning to speed up things.
You want to fetch data based on the dictionary type, that means that the dictionary type is data.
Data should be in the fields of the tables, not as table names or field names. If you don't have the data in the fields, you have a data model that needs changes if the data chances, and you need to create queries dynamically to get the data.
The first option uses the dictionary type as table names.
The second option uses the dictionary type as field names.
The third option correctly places the dictionary type as data in a field.
However, the term and the tag should not be strings, they should rather be foreign keys to tables where the terms and dictionary types are defined.
The requirements here are far too vague, resulting in the 'accepted answer' being totally over-'solved'. The requirements need to provide more information about how the dictionaries will be used.
That said, working off the little provided; I'd go with a variation on #3.
Number 1 is perfectly viable if the dictionaries will be used entirely independently, and the only reason the concept of shared terms was mentioned is that it just happens to be a coincidental possibility.
Ditch 2; it unnecessarily leads to NULL values in columns, and DB designs don't like that.
Number 3 is the best, but ditch the artificial key, and key on Term + Tag. Apart from the artificial key creating the possibility of duplicate entries (by Term + Tag). If no other tables reference TermDefinitions, the key is a waste; if something does; then they say (for example) "I'm referencing TermDefinition #3... Uhhm, whatever that is. :S"
In a nutshell, nothing provided so far in the requirement indicates any need for anything more complicated than option 3.