I have a tree encoded in a MySQL database as edges:
CREATE TABLE items (
num INT,
tot INT,
PRIMARY KEY (num)
);
CREATE TABLE tree (
orig INT,
term INT
FOREIGN KEY (orig,term) REFERENCES items (num,num)
)
For each leaf in the tree, items.tot is set by someone. For interior nodes, items.tot needs to be the sum of it's children. Running the following query repeatedly would generate the desired result.
UPDATE items SET tot = (
SELECT SUM(b.tot) FROM
tree JOIN items AS b
ON tree.term = b.num
WHERE tree.orig=items.num)
WHERE EXISTS
(SELECT * FROM tree WHERE orig=items.num)
(note this actually doesn't work but that's beside the point)
Assume that the database exists and the invariant are already satisfied.
The question is:
What is the most practical way to update the DB while maintaining this requirement? Updates may move nodes around or alter the value of tot on leaf nodes. It can be assumed that leaf nodes will stay as leaf nodes, interior nodes will stay as interior nodes and the whole thing will remain as a proper tree.
Some thoughts I have had:
Full Invalidation, after any update, recompute everything (Um... No)
Set a trigger on the items table to update the parent of any row that is updated
This would be recursive (updates trigger updates, trigger updates, ...)
Doesn't work, MySQL can't update the table that kicked off the trigger
Set a trigger to schedule an update of the parent of any row that is updated
This would be iterative (get an item from the schedule, processing it schedules more items)
What kicks this off? Trust client code to get it right?
An advantage is that if the updates are ordered correctly fewer sums need to be computer. But that ordering is a complication in and of it's own.
An ideal solution would generalize to other "aggregating invariants"
FWIW I know this is "a bit overboard", but I'm doing this for fun (Fun: verb, Finding the impossible by doing it. :-)
The problem you are having is clear, recursion in SQL. You need to get the parent of the parent... of the leaf and updates it's total (either subtracting the old and adding the new, or recomputing). You need some form of identifier to see the structure of the tree, and grab all of a nodes children and a list of the parents/path to a leaf to update.
This method adds constant space (2 columns to your table --but you only need one table, else you can do a join later). I played around with a structure awhile ago that used a hierarchical format using 'left' and 'right' columns (obviously not those names), calculated by a pre-order traversal and a post-order traversal, respectively --don't worry these don't need to be recalculated every time.
I'll let you take a look at a page using this method in mysql instead of continuing this discussion in case you don't like this method as an answer. But if you like it, post/edit and I'll take some time and clarify.
I am not sure I understand correctly your question, but this could work My take on trees in SQL.
Linked post described method of storing tree in database -- PostgreSQL in that case -- but the method is clear enough, so it can be adopted easily for any database.
With this method you can easy update all the nodes depend on modified node K with about N simple SELECTs queries where N is distance of K from root node.
I hope your tree is not really deep :).
Good Luck!
Related
I have this situation that is as simple as it is annoying.
The requirements are
Every item must have an associated category.
Every item MAY be included in a set.
Sets must be composed of items of the same category.
There may be several sets of the same category.
The desired logic procedure to insert new data is as following:
Categories are inserted.
Items are inserted. For each new item, a category is assigned.
Sets of items of the same category are created.
I'd like to get a design where data integrity between tables is ensured.
I have come up with the following design, but I can't figure out how to maintain data integrity.
If the relationship highlighted in yellow is not taken into account, everything is very simple and data integrity is forced by design: an item acquires a category only when it is assigned to a set and the category is given by the set itself.However, it would not be possible to have items not associated with a set but linked to a category and this is annoying.
I want to avoid using special "bridging sets" to assign a category to an item since it would feel hacky and there is no way to distinguish between real sets and special ones.
So I introduced the relationship in yellow. But now you can create sets of objects of different categories!
How can I avoid this integrity problem using only plain constraints (index, uniques, FK) in MySQL?
Also I would like to avoid triggers as I don't like them as it seems a fragile and not very reliable way to solve this problem...
I've read about similar question like How to preserve data integrity in circular reference database structure? but I cannot understand how to apply the solution in my case...
Interesting scenario. I don't see a slam-dunk 'best' approach. One consideration here is: what proportion of items are in sets vs attached only to categories?
What you don't want is two fields on items. Because, as you say, there's going to be data anomalies: an item's direct category being different to the category it inherits via its set.
Ideally you'd make a single field on items that is an Algebraic Data Type aka Tagged Union, with a tag saying its payload was a category vs a set. But SQL doesn't support ADTs. So any SQL approach would have to be a bit hacky.
Then I suggest the compromise is to make every item a member of a set, from which it inherits its category. Then data access is consistent: always JOIN items-sets-categories.
To support that, create dummy sets whose only purpose is to link to a category.
To address "there is no way to distinguish between real sets and special ones": put an extra field/indicator on sets: this is a 'real' set vs this is a link-to-category set. (Or a hack: make the set-description as "Category: <category-name>".)
Addit: BTW your "desired logic procedure to insert new data" is just wrong: you must insert sets (Step 3) before items (Step 2).
I think I might found a solution by looking at the answer from Roger Wolf to a similar situation here:
Ensuring relationship integrity in a database modelling sets and subsets
Essentially, in the items table, I've changed the set_id FK to a composite FK that references both set.id and set.category_id from, respectively, items.set_id and item.category_id columns.
In this way there is an overlap of the two FKs on items table.
So for each row in items table, once a category_id is chosen, the FK referring to the sets table is forced to point to a set of the same category.
If this condition is not respected, an exception is thrown.
Now, the original answer came with an advice against the use of this approach.
I am uncertain whether this is a good idea or not.
Surely it works and I think that is a fairly elegant solution compared to the one that uses tiggers for such a simple piece of a a more complex design.
Maybe the same solution is more difficult to understand and maintain if heavily applied to a large set of tables.
Edit:
As AntC pointed out in the comments below, this technique, although working, can give insidious problems e.g. if you want to change the category_id for a set.
In that case you would have to update the category_id of each item linked to that set.
That needs BEGIN COMMIT/END COMMIT wrapped around the updates.
So ultimately it's probably not worth it and it's better to investigate the requirements further in order to find a better schema.
Structure in MySql (for compactness i am using a simplified notation)
Notation: table name->[column1(key or index), column2, …]
documents->[doc_id(primary key), title, description]
elements->[element_id(primary key), doc_id(index), title, description]
Each document can contain a large number of elements (between 1 and 100k+)
We have two key requirements:
Load all elements for a given doc_id quickly
Update the value of one individual element by his element_id quickly
Structure in Cassandra
1st solution
documents->[doc_id(primary key), title, description, elements] (elements could be a SET or a TEXT, each time new elements are added (they are never removed) we would append it to this column)
elements->[element_id(primary key), title, description]
To load a document we would need:
Load document with given and get all element ids: SELECT * from documents where doc_id=‘id’
Load all elements with the given ids: SELECT * FROM elements where element_id IN (ids loaded from query a)
Updating elements would be done by their primary key.
2nd solution
documents->[doc_id(primary key), title, description]
elements->[element_id(primary key), doc_id(secondary index), title, description]
To load a document we would need:
SELECT * from elements where doc_id=‘id’
Updating elements would be done by their primary key.
Questions regarding our solutions:
1st: Will it be efficient to query 100k+ primary keys in the elements table?
SELECT * FROM elements WHERE element_id IN (element_id1,.... element_id100K+)?
2nd: Will it be efficient to query just by a secondary index?
Could anyone give any advice how would we create a model for our use case?
With cassandra it's all about the access pattern (I hope I understood it correctly, if not please comment)
1st
documents should not use sets because set is limited to 65 535 elements and has to be read, updated in it's entirety every time a change is made. Since you need 100k+ it's not what you want. You could use frozen collections etc, but then again, reading everything in memory every time is bound to be slow.
2nd
secondary indexes, well, small cardinality data might be fine But from I understand you have 100k per document, this might even be fine but then again It's not the best practice. I would simply try it out in your concrete case.
3rd - disk is cheap approach - always write the data the way you are going to read it - cassandra's writes are dirt cheap so prepare the views at write time,
this one satisfies reading of all the elements belonging to doc_id
documents->[doc_id(primary key), title_doc (static), description_doc(static), element_id(clustering key), title, description]
elements remain pretty much as they were:
elements->[element_id(primary key), doc_id, title, description]
When doing updates, you update it in documents and elements (for consistency you can use batch operation - should you need it) If having element_id you can quickly issue another query after you get it's doc Id.
Depending on your updating needs documentId could also be a set. (I might have not gotten this part right because not sure what data is available when updating the element do you have the doc_id also and can one element be in more docs?)
Also since having 100k+ elements in single partition is not the best thing to have because of the retrievals ( all requests will go to one node) I would propose to have composite partitioning key (bucket) I think in your case a simple fixed int would be just fine. So every time you go to retrieve the elements you just issue selects to documentid + (1, 2, 3, 4 ...) and then merge the result at the client - this will be significantly faster.
One tricky part would be that you don't go into every single bucket for elementid that is stored in the document ... when I think about it then it would be better to use a base of two for buckets. In your case 16 would be ideal ... then when you look to update specific element just use some simple hash function known to you and use last 4 bits.
Now when I think about it if the element id + doc id is always known to you you might not even need the elements table at all.
Hope this helps
Based on the suggestion of Marko, our solution is:
CREATE TABLE documents (
doc_id uuid,
description text,
title text,
PRIMARY KEY (doc_id)
);
CREATE TABLE nodes (
doc_id uuid,
element_id uuid,
title text,
PRIMARY KEY (doc_id, element_id)
);
We can retrieve all elements with the following query:
SELECT * FROM elements WHERE doc_id='id'
And update the elements:
UPDATE elements SET title='Hello' WHERE doc_id='id' AND element_id='id';
I have a star schema that tracks Roles in a company, e.g. what dept the role is under, the employee assigned to the role, when they started, when/if they finished up and left.
I have two time dimensions, StartedDate & EndDate. While a role is active, the end date is null in the source system. In the star schema i set any null end dates to 31/12/2099, which is a dimension member i added manually.
Im working out the best way to update the Enddate for when a role finishes or an employee leaves.
Right now im:
Populating the fact table as normal, doing lookups on all dimensions.
i then do a lookup against the fact table to find duplicates, but not including the EndDate in this lookup. non matched rows are new and so inserted into the fact table.
matching rows then go into a conditional split to check if the currentEndDate is different from the newEnd Date. If different, they are inserted into an updateStaging table and a proc is run to update the fact table
Is there a more efficient or tidier way to do this?
How about putting all that in a foreach container, it would iterate through and be much more efficient.
I think it is a reasonable solution. I personally would use a Stored Proc instead for processing efficiency, but with your dimensional nature of the DWH and implied type 2 nature, this is a valid way to do it.
The other way, is to do your "no match" leg of the SSIS as is, but in your "match" leg, you could insert the row into the actual fact table, then have a post process T-SQL step which would update the two records needed.
I have the following tables
Parent Table
ds_id(pk) / state
-------------------------
1. / valid
2. / invalid
Child Table
d_id(pk) / ds_id(fk) / approve
-----------------------------------------
1. / 1. / false
2. / 1. / true
3. / 2. / false
4. / 2. / false
The state column in the parent table should change to valid if one of its children in the child table has its approved column set to true
I want to find the simplest most efficient method for calculating and setting the state column based on its children.
I'm using SQL Server 2008.
The change of state would need to be instant.
It is expected that the system would have a few thousand parents, each with around 5 children.
It is more likely that the children would be updated
You haven't provided enough information to decide on what the "best" method is. The two basic ideas are updating the data in the parent when a child value changes or summarizing from the children at query time.
Here are some questions to decide between these approaches:
What is the ratio of reads at the parent level to changes on the child level? If the parent value would be read once for every thousand times that the child values change, then it is probably more efficient to do it dynamically (at query time). If the parent value is read one thousand times for each time a child value changes, then it is probably more efficient to do it statically (with an update).
What are the expected response times for changing a child and reading from a parent?
How much data are we talking about? If the child data is measured in hundreds of rows, it is probably not worth the effort to make the query more efficient.
And, the proposed data is a little awkward for automatic updating. If a child changes from approved = true to false, then what happens? You have to read all the other children in order to set the value in the parent. The alternative is to keep a count of the approved children, and then do logic on that value. One way to do that would be with a computed column in the parent table:
create table . . .
stats as (case when ApprovedCount > 0 then valid else invalid end)
As a general observation for the automatic updates, I think triggers are relatively hard to maintain. Using triggers does not seem "simple". Instead, I would have a stored procedure for updating the child table and use the stored procedure for the update and any additional logic.
[Responding to the comments]
The data being suggested is quite small. In all probability, the children and the parents will each fit on one data page. Unless you are going for transaction processing benchmarks, you can do the computation on the fly when you query the parents:
select p.id, (case when count(*) > 0 then valid else invalid end) as validness
from parent p left outer join
child c
on c.parentid = p.parentid and c.accepted = true
group by p.id
This will go really fast. And, if you are looking for only one parent at a time, it will be really, really fast. Of course, it would be a small amount faster if you pre-calculated the value. However, the increase in speed is highly unlikely to be important, relative to the complexity of maintaining the value.
CREATE TABLE record (
id INT PRIMARY KEY,
parent_id INT,
count INT NOT NULL
)
I have a table defined as above. The field 'parent_id' refers to the parent of the row, so the whole data looks like n-ary tree.
According to the business logic I have, when the field 'count' of a row is requested to increment (by one, for example), all of the ancestor nodes (or rows) should be updated to increment the 'count' field by one as well.
Since this 'count' field is expected to updated frequently (say 1000/sec), I believe that this recursive update would slow down the entire system a lot due to a huge cascading write operation in DBMS.
For now, I think a stored procedure is the best option I can choose. If MySQL support the operation like 'connected by' of Oracle, there can be some way to be tricky, but it doesn't, obviously.
Is there any efficient way to implement this?
Thanks in advance.
When you use stored procedures, you will still need recursion. You only move the recursion from the source code to the database.
You can use nested sets to store hierarchical data. Basically, you create two additional fields left and right where left < right. Then a node e1 is subordinate of node e2 iff e1.left > e2.left && e1.right < e2.right.
This gets rid of recursions at the price of higher costs for insertion, deletion and relocation of nodes. On the other hand, updates of node content like the one you described can be done in a single query. This is efficient, because an index can be used to retrieve a node and all of it's ancestors in a single query.