CREATE TABLE record (
id INT PRIMARY KEY,
parent_id INT,
count INT NOT NULL
)
I have a table defined as above. The field 'parent_id' refers to the parent of the row, so the whole data looks like n-ary tree.
According to the business logic I have, when the field 'count' of a row is requested to increment (by one, for example), all of the ancestor nodes (or rows) should be updated to increment the 'count' field by one as well.
Since this 'count' field is expected to updated frequently (say 1000/sec), I believe that this recursive update would slow down the entire system a lot due to a huge cascading write operation in DBMS.
For now, I think a stored procedure is the best option I can choose. If MySQL support the operation like 'connected by' of Oracle, there can be some way to be tricky, but it doesn't, obviously.
Is there any efficient way to implement this?
Thanks in advance.
When you use stored procedures, you will still need recursion. You only move the recursion from the source code to the database.
You can use nested sets to store hierarchical data. Basically, you create two additional fields left and right where left < right. Then a node e1 is subordinate of node e2 iff e1.left > e2.left && e1.right < e2.right.
This gets rid of recursions at the price of higher costs for insertion, deletion and relocation of nodes. On the other hand, updates of node content like the one you described can be done in a single query. This is efficient, because an index can be used to retrieve a node and all of it's ancestors in a single query.
Related
I have 1.7 million records in an access table sorted A to Z. the records are not unique and there are repeated records. I want to make them unique based on their frequency. if a record has been repeated 4 times I want the first one to get "-1" at the end of the record value, the second record get "-2" and so on. in this way similar records will become unique. all similar record are beside each other because of sorting. in excel I do this task by an If function (if this cell value<>the cell value above then "1" else above repeat number plus 1) but in access I don't know what to do (I'm a beginner).
finally I want to add a column to original table which is (original record value - repeat number).
I appreciate your help
Note about sort order:
Sort order in a relational database is not concrete like in a spreadsheet. There is no concept of rows being "next to each other", unless in context of an index. An index is largely a tool for the database to handle the data more efficiently (and to aid in defining uniqueness). The order itself is still largely dynamic because the order of a particular query can be specified differently from the index (or from storage order) and this does not change how the data is actually stored. Being "next to each other" is essentially a useless concept in SQL queries, unless you mean "next to each other numerically", for instance with an AutoNumber field or with the "repeat numbers" you want to add. Unlike in a spreadsheet, you cannot refer to the row "just above this row" or the "row offset by 2 from the 'current' row".
Solution
Regardless of whether or not you will use the AutoNumber column later, add a Long Integer AutoNumber column anyway. This column is named [ID] in the example code. Why? Because until you add something to allow the database to differentiate between the rows, there is technically no way using standard SQL to reliably reference individual duplicates since there is no way to distinguish individual rows. Even though you say that there are other differentiating columns, your own description rules out using them as a reliable key in referring to specific rows. (Even without such a differentiating column, Access can technically distinguish between rows. Iterating through a DAO.Recordset object in VBA would work, but perhaps not very elegant / efficient.)
Also add a new integer column for counting repeats, which below is named [DupeIndex]. A separate field is preferred (necessary?) because it allows continued reference to the original, unaltered duplicate values. If the reference number were directly updated, it would no longer match other fields and so would not be easily detected as a duplicate anymore. The following solution relies on grouping of ALL duplicate values, even those already "marked" with a [DupeIndex] number.
You should also realize that in comparing different data sets, that having separate fields allows more flexibility in matching the data. Having the values appended to the reference number complicates comparison, since you likely not only want to compare rows with the same duplication index, rather you will want to compare all possible combinations. For example, comparing records 123-1 in one set to 123-4 in another... how do you select such rows in an automated fashion? You don't want to have to manually code all combinations, but that's what you'll end up doing if you don't keep them separate like {123,1} and {123,4}.
Create and save this as a named query [Duplicates]. This query is referenced by later queries. It could instead be embedded as a sub query, but my preferences is to use saved queries for easier visualization and debugging in Access:
SELECT Data.RefNo, Count(Data.ID) AS Dupes, Max(Data.DupeIndex) AS IndexMax
FROM Data
GROUP BY Data.RefNo
HAVING Count(Data.ID) > 1
Execute the following to create a temporary table with new duplicate index values:
SELECT D1.ID, D1.RefNo,
IIf([Duplicates].[IndexMax] Is Null,0,[Duplicates].[IndexMax])
+ 1
+ (SELECT Count(D2.ID) FROM Data As D2
WHERE D2.[RefNo]=[D1].[RefNo]
And [D2].[DupeIndex] Is Null
And [D2].[ID]<[D1].[ID]) AS NewIndex
INTO TempIndices
FROM Data AS D1 INNER JOIN Duplicates ON D1.RefNo = Duplicates.RefNo
WHERE (D1.DupeIndex Is Null);
Execute the update query to set the new duplicate index values:
UPDATE Data
INNER JOIN TempIndices ON Data.ID = TempIndices.ID
SET Data.DupeIndex = [NewIndex]
Optionally remove the AutoNumber field and now assign the combined [RefNo] and new [DupeIndex] as primary key. The temporary table can also be deleted.
Comments about the queries:
Solution assume that [DupeIndex] is Null for unprocessed duplicates.
Solution correctly handles existing duplicate index numbers, only updating duplicate rows without an unique index.
Access has rather strict conditions for UPDATE queries, namely that updates are not based on circular references and/or that that joins will not produce multiple updates for the same row, etc. The temporary table is necessary in this case, since the query determining new index values refers multiple times in sub queries to the very column that is being updated. (If the update is attempted using joins on the subqueries, for example, Access complains that Operation must use an updatable query.)
I have a star schema that tracks Roles in a company, e.g. what dept the role is under, the employee assigned to the role, when they started, when/if they finished up and left.
I have two time dimensions, StartedDate & EndDate. While a role is active, the end date is null in the source system. In the star schema i set any null end dates to 31/12/2099, which is a dimension member i added manually.
Im working out the best way to update the Enddate for when a role finishes or an employee leaves.
Right now im:
Populating the fact table as normal, doing lookups on all dimensions.
i then do a lookup against the fact table to find duplicates, but not including the EndDate in this lookup. non matched rows are new and so inserted into the fact table.
matching rows then go into a conditional split to check if the currentEndDate is different from the newEnd Date. If different, they are inserted into an updateStaging table and a proc is run to update the fact table
Is there a more efficient or tidier way to do this?
How about putting all that in a foreach container, it would iterate through and be much more efficient.
I think it is a reasonable solution. I personally would use a Stored Proc instead for processing efficiency, but with your dimensional nature of the DWH and implied type 2 nature, this is a valid way to do it.
The other way, is to do your "no match" leg of the SSIS as is, but in your "match" leg, you could insert the row into the actual fact table, then have a post process T-SQL step which would update the two records needed.
I am new to MySQL partitioning, therefore any example will be appreciated.
I am trying to create a sort of an ageing mechanism for a data that is distributed between several MyISAM tables.
My question will actually include several sub-questions.
The relevant tables are:
First table contains raw data with high input frequency (next to each record there is an auto incremented id).
Second table contains processed results, there is a result record per every raw data record (result record contains the source id record of the auto incremented field of raw data record)
Questions:
I need to be able to partition the raw data table and result data table similarly so that both of them will include only 10 weeks of data in single partition (each raw data record contains unixtimestamp field), how do i do it , can someone write small example case for two such tables?.
I want to be able to change the 10 weeks constraint on the fly.
I want that when ever the current partition will be filled or a new partition is created , the previous (10 weeks before) partition will be deleted automatically.
I don't want the auto increment id integer to be overflown, as much as i understand the ids are unique for the partition only, so if i am not wrong the auto increment id will start from zero for the next partition? but what if the previous partition still exist, will i have 2 duplicated ids , how i know to reference only for the last id when i present a result record?
I want to load raw data using LOAD DATA INTO... instead of multiple inserts , is MySQL partitioning functionality affected?
And the last question, would you suggest some other approach to implement aging mechanism (i am writing Java implementation product that processes around 1 GB or raw data per day and stores the results in MySQL)
It's hard to give a real answer on this question since it depends on your data. But let me give you some things to think about.
I assume we're talking about some kind of logs with recent data (so not spanning multiple years). You can partition by range. You could add one field to your table with the year/week number (ie 201201, 201202, etc). If this question is related to your question about importing into multiple tables, you can easily do this is that import script.
On the fly as in, repartition your data on the fly (70GB?). I would not recommend it. But you could do it if you had the weeknumber in there. If you later want to change it to 12 days, you could add a column for the date and partition by that.
Well it won't be deleted automatically but a cron job can handle that right? Just check how many partitions there are, and if there are 3(?) delete the first one.
The partition needs to have a primary index on the field that you partition (if you want to use auto increment). Therefor you can never fully rely on the auto increment id alone. I don't see a way around this.
I'm not sure what you mean.
If your data is just some logs in chronological order then you might just use separate tables for each period. Then before you start the new period (at 00:00) check the last id of the last table, create a new table and set the auto increment to that value +1. Then your import will decide when a new period will begin so it can be easily changed. Your import script can use a small table in where it can store the next period.
LOAD DATA is really quite fast. I would just have two steps(in no partic order) - LOAD DATA and then 'delete .. where date < 10 weeks'. Autoincrement will go on for as long as the datatype you're using. If you wanted to be super careful you could push it back to zero periodically.
Once the data is in the 'raw' table run your routine to create the 'processed' table. We use a v similar process where I work. We keep a separate table that has 'write' and 'parse' pointers to all of our 'raw' tables. As new data comes in and gets parsed the appropriate row pointers get set. If the 'raw' table gets truncated you can reset the 'write' pointer but leave the 'parse' pointer. (we store the offset in another table when this happens - just to be sure).
And if I recommend , creating the index column for each of the related columns can also enhanced the performance Delete old data from multiple related tables since we have just compared the index numbers rather than strings.
I wonder if your tables are being sorted or not.
I have a table A (SQL Server 2008) that contains sets of min and max numbers. In one of my stored procedures I use this table joined with a product table B to find an available product number that's between min and max and then insert a new product with this product number. So I want the next free number that's between min/max.
Between finding out the next available number and inserting the product row I want a lock to prevent anyone of finding the same number (and giving a duplicate).
How should I think in this situation? Should I get an update lock on the A table even though I never modify it? The lock should be released after I do the insert into table B and the transaction finishes? Will this update lock prevent other transactions from reading table A?
Edit: The min/max table is a table for different product series. Depending on which serie you want i want to try and find an available number in this sequence. The productnr is not unique, it would be possible to make it unique in combination with a second column though. Simplified sp:
CREATE PROCEDURE [dbo].[InsertProduct]
(
#Param1 int,
#Param2 bit,
...
#Param20 int) AS
BEGIN
DECLARE #ProductNr int
--Here I do a query to determine which ProductNr I should have. Checking that the number is between max/min in the series and that no other product has this productnr.
--Now insert the row
INSERT INTO Products VALUES (#Param1, #ProductNr, ...., #Param2
END
Your question is kind of obscure, it would help if you included some sample data.
Regardless, a tactic that I have used in the past is to try and wrap everything withing a single statement -- here, it would be an INSERT. Every SQL statement is de-facto wrapped in its own implicit transaction (that's atomicity, what the "A" in the ACID of relational database properties fame stands for). In psuedo-code, it'd look something like:
INSERT MyTable (Id, Plus, Other, Columns)
select CalcForNewId, Plus, Other, Columns
from [what may be a pertty convoluted query to determine the "next" valid Id]
This only works if you can write your business logic as a single reasonable query (where "reasonable" means it doesn't lock, block, or deadlock either the current or any other user for an unreasonable length of time). That can be a pretty tall order, but I'd take that over having to write complex BEGIN TRANSACTION/COMMIT/ROLLBACK code intermixed with TRY...CATCH blocks any day. (That will of course work, and I've upvoted #Scott Bruns accordingly.)
Just insert the next number into table B. If it commits it is yours to use. If you get a Key Violation it means that a different process has just entered the new number. In that case increment the new number and try again.
The database will automatically handle the concerency for you. No manual locking is required.
I have a tree encoded in a MySQL database as edges:
CREATE TABLE items (
num INT,
tot INT,
PRIMARY KEY (num)
);
CREATE TABLE tree (
orig INT,
term INT
FOREIGN KEY (orig,term) REFERENCES items (num,num)
)
For each leaf in the tree, items.tot is set by someone. For interior nodes, items.tot needs to be the sum of it's children. Running the following query repeatedly would generate the desired result.
UPDATE items SET tot = (
SELECT SUM(b.tot) FROM
tree JOIN items AS b
ON tree.term = b.num
WHERE tree.orig=items.num)
WHERE EXISTS
(SELECT * FROM tree WHERE orig=items.num)
(note this actually doesn't work but that's beside the point)
Assume that the database exists and the invariant are already satisfied.
The question is:
What is the most practical way to update the DB while maintaining this requirement? Updates may move nodes around or alter the value of tot on leaf nodes. It can be assumed that leaf nodes will stay as leaf nodes, interior nodes will stay as interior nodes and the whole thing will remain as a proper tree.
Some thoughts I have had:
Full Invalidation, after any update, recompute everything (Um... No)
Set a trigger on the items table to update the parent of any row that is updated
This would be recursive (updates trigger updates, trigger updates, ...)
Doesn't work, MySQL can't update the table that kicked off the trigger
Set a trigger to schedule an update of the parent of any row that is updated
This would be iterative (get an item from the schedule, processing it schedules more items)
What kicks this off? Trust client code to get it right?
An advantage is that if the updates are ordered correctly fewer sums need to be computer. But that ordering is a complication in and of it's own.
An ideal solution would generalize to other "aggregating invariants"
FWIW I know this is "a bit overboard", but I'm doing this for fun (Fun: verb, Finding the impossible by doing it. :-)
The problem you are having is clear, recursion in SQL. You need to get the parent of the parent... of the leaf and updates it's total (either subtracting the old and adding the new, or recomputing). You need some form of identifier to see the structure of the tree, and grab all of a nodes children and a list of the parents/path to a leaf to update.
This method adds constant space (2 columns to your table --but you only need one table, else you can do a join later). I played around with a structure awhile ago that used a hierarchical format using 'left' and 'right' columns (obviously not those names), calculated by a pre-order traversal and a post-order traversal, respectively --don't worry these don't need to be recalculated every time.
I'll let you take a look at a page using this method in mysql instead of continuing this discussion in case you don't like this method as an answer. But if you like it, post/edit and I'll take some time and clarify.
I am not sure I understand correctly your question, but this could work My take on trees in SQL.
Linked post described method of storing tree in database -- PostgreSQL in that case -- but the method is clear enough, so it can be adopted easily for any database.
With this method you can easy update all the nodes depend on modified node K with about N simple SELECTs queries where N is distance of K from root node.
I hope your tree is not really deep :).
Good Luck!