modeling hierarchy/directory in a relational database

modeling hierarchy/directory in a relational database - mysql

I would like to model a hierarchy/directory like what you see below, in a mysql table. You can see below the table schema I was thinking.
However, the directory Im talking about would be comprised of 100.000 elements and the depth would be ~5-10 levels. Moreover, we will have a tags pool and each element of the directory may be linked to one or more tags. So I was wondering if there is better approach. I was reading that some people decide to design tables that are not canonical for the shake of high performance and I am evaluating this case too.
ps: some people use Multi-way Trees to model this in the programming language level so the question how this ends up in the database remains.
hierarchy:
A
| -> 1
|->1
|->2
| -> 2
| -> 3
B
| -> 1
| -> 2
table:
___________________________
| id |element | father |
|---------------------------|
| 000 | A | null |
| 001 | 1 | 000 |
| 002 | 1 | 001 |
| 003 | 2 | 001 |
| 004 | 2 | 000 |
| 005 | 3 | 000 |
| 006 | B | null |
| 001 | 1 | 006 |
| 002 | 2 | 006 |
-----------------------------

A very fast hierachical tree is a nested set or a Celko-tree, it's a bit like a binary tree, or a huffman tree when you have a MySQL storage engine. Disadvantage is expensive deletion and insertion. Other RDBMS supports also recursive queries. In general I didn't see many nested sets. It seems to be complicated too create and maintain. When the nested set is too complicated and the RDBMS doesn't support recursive queries there is also materialized path.
http://www.ibase.ru/devinfo/DBMSTrees/sqltrees.html
http://en.wikipedia.org/wiki/Binary_tree
http://en.wikipedia.org/wiki/Huffman_coding
http://www.postgresql.org/docs/8.4/static/queries-with.html
Is it possible to make a recursive SQL query?
http://www.cybertec.at/pgbook/node122.html

Related

how should I build up my database when I want to store these kind of data?

I want to build a page like shown below and all data should be retrieved from a database. Both the term, subject and sentences is retrieved from a database. Three levels of data. And under each term (eg. Spring 2017) I can pick and choose between all of these sentences.
Spring 2017
Subject1
Sentence 1
Sentence 2
Sentence 3
Subject2
Sentence 13
Sentence 12
Sentence 17
Subject3
Sentence 11
Sentence 14
Sentence 19
Autmn 2017
...
I want to present similar info from database to user, and let the user choose between all this sentences. How should i build up my database for achieving this in the best and most efficient way.
One way is:
Table 'subject' Table 'sentences'
| id | subjects | | id | subjectid | name |
| 3 | Subject1 | | 1 | 3 | Sentence 2 |
| 4 | Subject2 | | 2 | 4 | Sentence 13 |
Table 'term'
| id | term | sentenceid |
| 1 | Spring 17 | 1,2,28 |
Another way is maybe using pivot-tables, something like this:
Table 'sentences'
| id | parentid | name |
| 1 | 0 | Subject2 |
| 2 | 3 | Sentence 2 |
| 3 | 0 | Subject1 |
| 4 | 1 | Sentence 13 |
Table 'term'
| id | term | sentenceid |
| 1 | Spring 17 | 2,4,28 |
Notice: Number of terms can be many more than just two in a year.
Is it any of this structures you recommend, or any other way you think I should build my database? Is one of these more efficient? Not so demanding? Easier to adjust?

You are doing relational analysis/design:
Find all substantives/nouns of your domain. These are candidates for tables.
Find any relationships/associations between those substantives. "Has", "consists of", "belongs to", "depends on" and so on. Divide them into 1:1, 1:n, n:m associations.
look hard at the 1:1 ones and check if you can reduce two of your original tables into one.
the 1:n lead you to foreign keys in one of the tables.
the n:m give you additional association tables, possibly with their own attributes.
That's about it. I would strongly advise against optimizing for speed or space at this point. Any modem RDBMS will be totally indifferent against the number of rows you are likely to encounter in your example. All database related software (ORMs etc.) expect such a clean model. Packing ids into comma separated fields is an absolutes no-no as it defeats all mechanisms your RDBMS has to deal with such data; it makes the application harder to program; it confuses GUIs and so on.
Making weird choices in your table setup so they deviate from a clean model of your domain is the #1 cause of trouble along the way. You can optimize for performance later, if and when you actually get into trouble. Except for extreme cases (huge data sets or throughput), such optimisation primarily takes place inside the RDBMS (indexes, storage parameters, buffer management etc.) or by optimizing your queries, not by changing the tables.

If the data is hierarchical, consider representing it with a single table, with one column referencing a simple lookup for the "entry type".
Table AcademicEntry
================================
| ID | EntryTypeID | ParentAcademicEntryID | Description |
==========================================================
| 1 | 3 | 3 | Sentence 1 |
| 2 | 1 | <null> | Spring 2017 |
| 3 | 2 | 2 | Subject1 |
Table EntryType
================================
| ID | Description |
====================
| 1 | Semester |
| 2 | Subject |
| 3 | Sentence |

Start with the terms. Every term has subjects. Every subject has sentences. Then you may need the position of a subject within a term and probably the position of a sentence in a subject.
Table 'term'
id | term
---+------------
1 | Spring 2017
Table 'subject'
id | title | termid | pos
---+----------+--------+----
3 | Subject1 | 1 | 1
4 | Subject2 | 1 | 2
5 | Subject3 | 1 | 3
Table 'sentence'
id | name | subjectid | pos
---+-------------+-----------+-----
1 | Sentence 2 | 3 | 2
2 | Sentence 13 | 4 | 1
3 | Sentence 1 | 3 | 1
4 | Sentence 3 | 3 | 3
2 | Sentence 17 | 4 | 3
...

This table design Should resolve your need.
TblSeason
(
SeasonId int,
SeasonName varchar(30)
)
tblSubject
(
Subjectid int
sessionid int (fk to tblsession)
SubjectData varchar(max)
)
tblSentences
(
SentencesID INT
Subjectid int (Fk to tblSubject)
SentenceData varchar(max)
)

Multiple Data Sources in Microsoft Excel SQL Query

I have a lot of spreadsheets that pull transactional information from our ERP software into Excel using the Microsoft Query that we then perform other calculations on automatically. Recently we upgraded our ERP system, but management made the decision to leave the transactional history in the old databases to have a clean one going forward in the new system. I still need to have some "rolling 12 months" graphs, but if I use only the old database, I'm missing new data and if I use only the new, I'm missing the last 11 months data.
Is there a way that I can write a query in Excel to pull data from the old database PartTran table and merge it with the new database PartTran table without user intervention each time? For instance, I don't want my users (if possible) to have to have two queries that they copy and paste into one Excel table. The schema of the tables (at least the columns I need) are identically named and defined.

If you want to take a bit of a fun, hacky Excel approach, you could do the "copy-paste" bit FOR your users behind the scenes. Given two similar tables OLD and NEW with structures
+-----+------+-------+------------+
| id | foo | bar | date |
+-----+------+-------+------------+
| 95 | blah | $25 | 2015-06-01 |
| 96 | bork | $12 | 2015-07-01 |
| 97 | bump | $200 | 2015-08-01 |
| 98 | fizz | | 2015-09-01 |
| 99 | buzz | $50 | 2015-10-01 |
| 100 | char | ($1) | 2015-11-01 |
| 101 | mope | | 2015-12-01 |
+-----+------+-------+------------+
and
+----+-----+-------+------------+------+---------+
| id | foo | bar | date | fizz | buzz |
+----+-----+-------+------------+------+---------+
| 1 | cat | ($10) | 2016-01-01 | 285B | 1110111 |
| 2 | dog | $25 | 2016-02-01 | 27F5 | 1110100 |
| 3 | ant | $100 | 2016-03-01 | 1F91 | 1001111 |
+----+-----+-------+------------+------+---------+
... you can union together the data for these two datasets with some prudent excel wizardry as below:
Your UNION table ( named using alt+j+t+a ) should have the following items:
New natural ID
DataSet pointer ( name of old or new table )
Derived ID from original dataset
Columns of data you want from Old & New DataSets
example:
+---------+------------+------------+----+------+-----+------------+------+------+
| UnionId | SourceName | SourceRank | id | foo | bar | date | fizz | buzz |
+---------+------------+------------+----+------+-----+------------+------+------+
| 1 | OLD | | | | | | | |
| 2 | NEW | | | | | | | |
+---------+------------+------------+----+------+-----+------------+------+------+
You will then make judicious use of Indirect() and VlookUp() to derive the lookup id and column targets. Sample code below
SourceRank - helper column
=COUNTIFS([SourceName],[#SourceName],[UnionId],"<="&[#UnionId])
id - the id from the original DataSet
=SMALL(INDIRECT([#SourceName]&"[id]"),[#SourceRank])
Everything else is just VlookUp madness!! Although I've taken the liberty of copying the sample code below for reference
foo =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[foo]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
bar =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[bar]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
date =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[date]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
fizz =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[fizz]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
buzz =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[fizz]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
Output
You'll likely want to make prudent use of If() and/or IfError() to help your users ignore the new column references to the old table and those rows that do not yet have data. Without that, however, you'll end up with something like the below.
This is both ready to accept & read new inputs to both OLD and NEW DataSets and is sortable to get rid of those pesky placeholder rows...
Hope this helps! Happy coding!

MySQL - At what point should more than one table be used?

Edit for future viewers: Aside from the accepted answer which helped me I found some really good info here .
I've got a database with a single table for displaying inventory on a website (RVs). It stores the typical info: year, make, model, etc. I originally made it with 6 extra columns for storing "special features", but I don't like having such a hard limit on what options can be listed. Since I've never messed with more than a single table my gut instinct was to just add 24 or so more columns to cover everything, but something in my head told me that there might be a better way. So when do I decide N columns is too many? The data in these columns will commonly not be unique.
(Sorry for crappy diagram)
Current table design:
-----------------------------------------------------------------------
| id | year | make | model | price | ft_1 | ft_2 | ft_3 | ft_4 | ft_5 |
-----------------------------------------------------------------------
| | | | | | | | | | |
-----------------------------------------------------------------------
Possible better design:
table #1
------------------------------------
| id | year | make | model | price |
------------------------------------
| | | | | |
------------------------------------
table #2
---------------------------------------------
| unique_id(?) | feature | unit_ref |
---------------------------------------------
| 0 | "Diesel Pusher" | 2,6,14 |
---------------------------------------------
I feel like a bonus of the second table might be that I could more easily propagate a dropdown containing all the previously entered features to speed up adding new units to inventory.
Is this the right way to go about it, or should I just add more columns and be content?
Thanks.

Believe it or not, your best option would likely be to add a third table.
Since each record in your rvs table can be linked to multiple rows in the features table, and each feature can correspond to multiple rvs, you have a many-to-many relationship which is inherently difficult to maintain in a relational dbms. By adding a third "intersection" table you convert it to a one-to-many-to-one relationship which can be enforced declaratively by the dbms.
Your table structure would then become something like
rvs
------------------------------------
| id | year | make | model | price |
------------------------------------
| | | | | |
------------------------------------
features
--------------------------
| id | feature |
--------------------------
| 1192 | "Diesel Pusher" |
--------------------------
rv_features
----------------------
| rv_id | feature_id |
----------------------
| | |
----------------------
How do you make use of this? Suppose you want to record the fact that the 2016 Travelmore CampMaster has a 25kW diesel generator. You would first add a record to rvs like
--------------------------------------------------
| id | year | make | model | price |
--------------------------------------------------
| 0231 | 2016 | Travelmore | CampMaster | 750000 |
| 2101 | 2016 | Travelmore | Domestant | 650000 |
--------------------------------------------------
(Note the value in the id column is entirely arbitrary; its sole purpose is to serve as the primary key which uniquely identifies the record. It can encode meaningful information, but it must be something that will not change throughout the life of the record it identifies.)
You then add (or already have) the generator in the features table:
--------------------------------
| id | feature |
--------------------------------
| 1192 | Diesel Pusher 450hp |
| 3209 | diesel generator 25kW |
--------------------------------
Finally, you associate the rv to the feature with a record in rv_features:
----------------------
| rv_id | feature_id |
----------------------
| 0231 | 3209 |
| 0231 | 1192 |
| 2101 | 3209 |
----------------------
(I've added a few other records to each table for context.)
Now, to retrieve the features of the 2016 CampMaster, you use the following SQL query:
SELECT r.year, r.make, r.model, f.feature
FROM rvs r, features f, rv_features rf
WHERE r.id = rf.rv_id
AND rv.feature_id = f.id
AND r.id = '2031';
to get
----------------------------------------------------------
| year | make | model | feature |
----------------------------------------------------------
| 2016 | Travelmore | CampMaster | diesel generator 25kW |
| 2016 | Travelmore | CampMaster | Diesel Pusher 450hp |
----------------------------------------------------------
To see the rvs with a 25kW generator, change the query to
SELECT r.year, r.make, r.model, f.feature
FROM rvs r, features f, rv_features rf
WHERE r.id = rf.rv_id
AND rv.feature_id = f.id
AND f.id = '3209';
Sherantha's link to A Quick-Start Tutorial on Relational Database Design actually looks like a good intro to table design and normalization; you might find it useful.

There is a thing calles "third normal form" it says that everything without the unique ids shuld be unique. This means you need to make a table for year, a table for make a table for models etc and a table where you can combine all these ids to one connected dataset.
But this is not always practical, io think the best way to take this is something in between, like tables for entrys that repeat very often, but there dont need to be an extra table for price with unique ids, that would be overkill i think.

Based upon your scenario, if you believe no. of features columns remain same then no need for second table. And in case if there any possibility that features can be increased at any time in future then you should break up your table into two. (RVS & Features). Then create a third table that identify RVS & features as it seems there is many-to-many relationship. So I suggest you to use three tables.

I think it is better for you to be more familiar with relational database design. This is a short but great article I have found earlier.

How to save language skill levels correctly in a database

I think I am before a problem where many of you were before. I have a registration form where a user can pick any language of the planet and then pick his skill level for the respective language from a selectbox.
So, for example:
Language1: German
Skill: Fluent
Language2: English
Skill: Basic
I'm thinking what's the best way to store these values in a MySQL database.
I thought of two ways.
First way: creating a column for each language and assigning a skill value to it.
--------------------------------------------------
| UserID | language_en | language_ge |
--------------------------------------------------
| 22 | 1 | 4 |
--------------------------------------------------
| 23 | 3 | 4 |
--------------------------------------------------
So the language is always the column's name and the number represents the skill level (1. Basic, 2. Average ... )
I believe this is a nice way to work with these things and it is also pretty fast. The problem starts when there are 50 languages or more. It doesn't sound like a good idea to make 50 columns where the script always have to check them all if a user have any skill in that language.
Second way: inserting an array in one of the table's column. The table will look like this:
----------------------------------
| UserID | languages |
----------------------------------
| 22 | "ge"=>"4", "en"=>"1" |
----------------------------------
This way the user with ID 22 has skill level 4 for Germany and skill level 1 for English. This is fine because we don't need to check 50 additional columns (or even more) but it's not the right way in my eyes anyway.
We have to parse a lot of results and find a user with, for example, has level 1 for Germany and level 2 for Spanish without looking for the English skill level - it will take the server's a longer time and when bigger data comes we are in trouble.
I bet many of you have experienced this kind of issue. Please, can someone advise me how to sort this out?
Thanks a lot.

I'd advise you to have a separate table with all the languages:
Table: Language
+------------+-------------------+--------------+
| LanguageID | LanguageNameShort | LanguageName |
+------------+-------------------+--------------+
| 1 | en | English |
| 2 | de | German |
+------------+-------------------+--------------+
And another table to link the users to the languages:
Table: LanguageLink
+--------+------------+--------------+
| UserID | LanguageID | SkillLevelID |
+--------+------------+--------------+
| 22 | 1 | 1 |
| 22 | 2 | 4 |
| 23 | 1 | 3 |
| 23 | 2 | 4 |
+--------+------------+--------------+
This is the normalised way to represent that kind of relations in a DB. All data is easily searchable and you don't have to change the DB scheme if you add a language.
To render a user's languages you could use a query like that. It will give you a row per lanugage a user speaks:
SELECT
LanguageLink.UserID,
LanguageLink.SkillLevelID,
Language.LanguageNameShort
FROM
LanguageLink,
Language
WHERE
LanguageLink.UserID = 22
AND LanguageLink.LanguageID = Language.LanguageID
If you want to go further, you could create another table fo the skill level:
Table: Skill
+--------------+-----------+
| SkillLevelID | SkillName |
+--------------+-----------+
| 1 | bad |
| 2 | mediocre |
| 3 | good |
| 4 | perfect |
+--------------+-----------+
What I've done here is called Database normalization. I'd recommend reading about it, it may help you design further databases.

MySQL Storing different nested sets in same table

I have a table that stores nested sets. It stores different nested sets differentiated by a collectionid (yes i'm mixing terms here, really should be nestedsetid). it looks somewhat like this:
id | orgid | leftedge | rightedge | level | collectionid
1 | 123 | 1 | 6 | 1 | 1
2 | 111 | 2 | 3 | 2 | 1
3 | 23 | 4 | 5 | 2 | 1
4 | 67 | 1 | 2 | 1 | 2
5 | 123 | 3 | 4 | 1 | 2
6 | 600 | 1 | 6 | 1 | 3
7 | 11 | 2 | 5 | 2 | 3
8 | 111 | 3 | 4 | 3 | 3
Originally I wanted to take advantage of the R-Tree Indexes, but the code i have seen for this: LineString(Point(-1, leftedge), Point(1, rightedge)) won't quite work since it doesn't take into account the collectionid and this id:1 and id:6 would end up being the same.
Is there a way I can use the R-Tree index with my current set up... Surely you can have different nested sets in the same table? My main aim is to be able to use the MBRWithin and MBRContains functions. Using MySQL 5.1

For single-dimensional data (these are 1d intervals, right?) there exist better index structures than r-trees. These are designed for dynamic data in 2-10 dimensions (at higher dimensions, performance isn't too good, as the split strategies and distance functions don't work very well anymore)
Actually for your use case, classic SQL should work very well. And the database can make use of its indexes efficiently. Having a good index structure is one thing, but you want to have the database exploit the indexes it has as good as possible.
As such, I'd just index leftEdge and rightEdge and the <, <=, >, >= functions. They are fast! And for the collectionid column, a bitmap index should be good.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008