How to merge 2 Records in innoDB MySQL databases - mysql

This is related to How to change ID in mysql
I also have checked other questions and none are quite like this one.
As we know, innodb has a feature. If I want to channge an id of a record for example, then all other table that point to the previous ID will magically be updated.
What about if I want to MERGE 2 records?
Say I have 2 businesses.
They have 2 ID.
I want to merge them into one. I also want to use innodb awesome feature to automatically change things.
I can't just change one of the id to the other ID. Or can I?
What would you do to merge 2 simmilar records in database?
Of course what actually goes into the combined record will be business decisions.
Basically I just do not want to pin point all the other table one by one. I think on update rule is there for a reason. Is there a way where I just change slaveID to masterID, keep ALL data in master the same, and then have the database itself (rather than my program) to repoint all tables that point to slaveID to point to masterID? of course, records for slaveID will be gone anyway.
For example, with normal mysql engine, you can change ID, and then you have to go through all table that points to the old ID to point the new ID instead. With innodb, that repointing is done by the database engine itself. Which is kind of cool. Why would anyone use non innodb engine anyway.
I want to do the same but for merging.

Trying to set a records primary key to an already existing value will simply result in a key violation error. While this is simple on a first glance, it has a side effect: You can not use ON UPDATE CASCADE to merge two records - it will simply not work.
If you have the possibility to change the schema, you can use the old but good redirect-trick:
(Assuming your IDs are positive, maybe unsigend ints)
add a field redirect int not null default 0
Create a view:
.
CREATE VIEW tablename_view
SELECT
-- repeat next line for every field apart from redirect
IF(s.redirect>0,m.<fieldname>,s.<fieldname>
FROM tablename AS s
LEFT JOIN tablename AS m ON s.redirect=m.id
When you merge a record (slave) into another record (master) run UPDATE tablename SET redirect=<id_of_master> WHERE id=<id_of_slave>
Adapt your select queries to select from tablename_view instead of tablename
Create and use a maintenance script to weed out merger slaves

Related

Database Architecture for logging

This is something that has bothered me for a long time and i still have been unable to find an answer.
I have a huge system with alot of different features. What is common for this system is of course that my users can
create, update, read & delete
different parts of my system.
For simple reasons lets say i have an application that has the following features:
Document administration
Video administration
User administration
Salery administration
(Please do note i took these at random just to prove a point that all of these would have their own separate tables and does not necessarily be connected).
Now i wish to create some sort of logging system. So that when ever someone either create,update or delete an entity it will be recorded.
Now as far as i can see i can do this two ways.
1.
Create a logging table for each of the 4 features that is in my system. However with this method i am required to create a logging table for each new feature i add to the system. i would also have to combine data from X number of tables if i wish to create a log which potentially could be a huge task!
2.
i could create something like the following:
However once again i would have to add a col for each new feature i will add.
So my question is what is the best way for creating logging database architecture
Or is there an easier way?
Instead of one target_xx for each feature, you could do it this way:
target_id | target_type
1 video
4 document
5 user
2 user
or even better. A table with target types and insert only the respective id's on target_type
Something like this:
if you want to capture for each table creation and update date, i would just use the default and the update event from mysql. You can define the fields like this for a table:
ALTER TABLE table
ADD COLUMN CreateDate Datetime DEFAULT CURRENT_TIMESTAMP,
ADD COLUMN LastModifiedDate Datetime ON UPDATE CURRENT_TIMESTAMP;
You can add these 2 fields in all tables. If you want to use one central table for logging (which might be more difficult to manage, because you always need to create joins, maybe also worse performance), then I would work with triggers.

MySQL on duplicate key delete

I am looking for a (not too convoluted) solution for a MySQL problem. Say I have the following table (with a joint index on group and item):
Group item
nogroup item_a
group_a item_a
Then, eventually, item_a no longer belongs to group_a. So I want to do something like:
update table set group = "nogroup" where item = "item_a" on duplicate key delete.
(obviously this is not a valid symtax but I am looking for a way around this)
I still want to keep a copy of the record with nogroup because, if later on, item_a comes back, i can change its group back to group_a or any other group depending on the case. Whenever item_a is added, there is an insert and it copies all the data from the nogroup record and sets a proper group label. At that point there are two records for item_a: one with group_a and one with no group. The reason it is done this way is to reuse previous data as much as possible as a new entry(with no previous record) is much more involved and take significantly more time and processing.
Say an item belongs to group_a and group_b but suddenly it does not belong to any group: the first update to set group to "nogroup" will work but the second update will create a duplicate key entry error.
The option of "not updating the group column at all" and using "insert on duplicate key update" does not work because there won't be duplicates when the groups are different and this will lead to cases where an item does not belong to a group anymore and yet the record will still be present in the database. The option of verifying if "nogroup" exists first and then updating it to a specific group does not work either because if item_a belongs to more than one group this would update all other records to the same group.
Basically, an item can belong to 1) any number of groups including "nogroup" or 2) solely belonging to "nogroup" and there should always be a copy of at least nogroup somewhere in the database.
It looks like I won't be able to do this in just one query but if someone has a clean way of dealing with this, that would be much appreciated. Maybe some of my assumptions above are wrong and there is an easy way to do it.
Your whole process of maintaining this items-to-groups mapping sounds too complicated. Why not just have a table that has a mapping? Then, when an item is removed from a group, delete it from the table. When it is added, add it to the table. Don't bother with "nogroup".
If you want an archive table, then create one. Have an insert/update/delete trigger (whichever is or are appropriate) that will populate an archive with information that you want to keep over time.
I do not understand why re-using an existing row would be beneficial in terms of performance. There is no obvious database reason why this would be the case.
I am also confused as to why you need a "nogroup" tag at all. If you need a list of items, maintain that list in its own table. And call the table Items -- a much clearer name than "nogroup".
I agree with Gordan's approach. However if you have to do it with a single table it cannot be done in 1 SQL query. You will have to use 2 queries 1 for update and 1 for delete.

adding data to interrelated tables..easier way?

I am a bit rusty with mysql and trying to jump in again..So sorry if this is too easy of a question.
I basically created a data model that has a table called "Master" with required fields of a name and an IDcode and a then a "Details" table with a foreign key of IDcode.
Now here's where its getting tricky..I am entering:
INSERT INTO Details (Name, UpdateDate) Values (name, updateDate)
I get an error: saying IDcode on details doesn't have a default value..so I add one then it complains that Field 'Master_IDcode' doesn't have a default value
It all makes sense but I'm wondering if there's any easy way to do what I am trying to do. I want to add data into details and if no IDcode exists, I want to add an entry into the master table. The problem is I have to first add the name to the fund Master..wait for a unique ID to be generated(for IDcode) then figure that out and add it to my query when I enter the master data. As you can imagine the queries are going to probably get quite long since I have many tables.
Is there an easier way? where everytime I add something it searches by name if a foreign key exists and if not it adds it on all the tables that its linked to? Is there a standard way people do this? I can't imagine with all the complex databases out there people have not figured out a more easier way.
Sorry if this question doesn't make sense. I can add more information if needed.
p.s. this maybe a different question but I have heard of Django for python and that it helps creates queries..would it help my situation?
Thanks so much in advance :-)
(decided to expand on the comments above and put it into an answer)
I suggest creating a set of staging tables in your database (one for each data set/file).
Then use LOAD DATA INFILE (or insert the rows in batches) into those staging tables.
Make sure you drop indexes before the load, and re-create what you need after the data is loaded.
You can then make a single pass over the staging table to create the missing master records. For example, let's say that one of your staging table contains a country code that should be used as a masterID. You could add the master record by doing something along the lines of:
insert
into master_table(country_code)
select distinct s.country_code
from staging_table s
left join master_table m on(s.country_code = m.country_code)
where m.country_code is null;
Then you can proceed and insert the rows into the "real" tables, knowing that all detail rows references a valid master record.
If you need to get reference information along with the data (such as translating some code) you can do this with a simple join. Also, if you want to filter rows by some other table this is now also very easy.
insert
into real_table_x(
key
,colA
,colB
,colC
,computed_column_not_present_in_staging_table
,understandableCode
)
select x.key
,x.colA
,x.colB
,x.colC
,(x.colA + x.colB) / x.colC
,c.understandableCode
from staging_table_x x
join code_translation c on(x.strange_code = c.strange_code);
This approach is a very efficient one and it scales very nicely. Variations of the above are commonly used in the ETL part of data warehouses to load massive amounts of data.
One caveat with MySQL is that it doesn't support hash joins, which is a join mechanism very suitable to fully join two tables. MySQL uses nested loops instead, which mean that you need to index the join columns very carefully.
InnoDB tables with their clustering feature on the primary key can help to make this a bit more efficient.
One last point. When you have the staging data inside the database, it is easy to add some analysis of the data and put aside "bad" rows in a separate table. You can then inspect the data using SQL instead of wading through csv files in yuor editor.
I don't think there's one-step way to do this.
What I do is issue a
INSERT IGNORE (..) values (..)
to the master table, wich will either create the row if it doesn't exist, or do nothing, and then issue a
SELECT id FROM master where someUniqueAttribute = ..
The other option would be stored procedures/triggers, but they are still pretty new in MySQL and I doubt wether this would help performance.

Versioned and indexed data store

I have a requirement to store all versions of an entity in a easily indexed way and was wondering if anyone has input on what system to use.
Without versioning the system is simply a relational database with a row per, for example, person. If the person's state changes that row is changed to reflect this. With versioning the entry should be updated in such a way so that we can always go back to a previous version. If I could use a temporal database this would be free and I would be able to ask 'what is the state of all people as of yesterday at 2pm living in Dublin and aged 30'. Unfortunately there doesn't seem to be any mature open source projects that can do temporal.
A really nasty way to do this is just to insert a new row per state change. This leads to duplication, as a person can have many fields but only one changing per update. It is also then quite slow to select the correct version for every person given a timestamp.
In theory it should be possible to use a relational database and a version control system to mimic a temporal database but this sounds pretty horrendous.
So I was wondering if anyone has come across something similar before and how they approached it?
Update
As suggested by Aaron here's the query we currently use (in mysql). It's definitely slow on our table with >200k rows. (id = table key, person_id = id per person, duplicated if the person has many revisions)
select name from person p where p.id = (select max(id) from person where person_id = p.person_id and timestamp <= :timestamp)
Update
It looks like the best way to do this is with a temporal db but given that there aren't any open source ones out there the next best method is to store a new row per update. The only problem is duplication of unchanged columns and a slow query.
There are two ways to tackle this. Both assume that you always insert new rows. In every case, you must insert a timestamp (created) which tells you when a row was "modified".
The first approach uses a number to count how many instances you already have. The primary key is the object key plus the version number. The problem with this approach seems to be that you'll need a select max(version) to make a modification. In practice, this is rarely an issue since for all updates from the app, you must first load the current version of the person, modify it (and increment the version) and then insert the new row. So the real problem is that this design makes it hard to run updates in the database (for example, assign a property to many users).
The next approach uses links in the database. Instead of a composite key, you give each object a new key and you have a replacedBy field which contains the key of the next version. This approach makes it simple to find the current version (... where replacedBy is NULL). Updates are a problem, though, since you must insert a new row and update an existing one.
To solve this, you can add a back pointer (previousVersion). This way, you can insert the new rows and then use the back pointer to update the previous version.
Here is a (somewhat dated) survey of the literature on temporal databases: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.6988&rep=rep1&type=pdf
I would recommend spending a good while sitting down with those references and/or Google Scholar to try to find some good techniques that fit your data model. Good luck!

How to delete from a database?

I know of two ways to delete data from a database table
DELETE it forever
Use a flag like isActive/isDeleted
Now the problem with isActive is that I have to track everywhere in my SQL queries that whether the record is active or not. Using DELETE however gets rid of the data forever.
What would be the best way to backup this data?
Assuming I have multiple tables in a database, should I have a common function which just backs everything up and stores it in another table (in XML probably?) or is there any other way.
I am using MySQL but am curious about techniques used in other DBs as well.
Replace the table with a view that hides the inactive items.
Or write a trigger on DELETE that backs up the row to an archive table.
You could use a trigger that fires on deleting records to back them up into some kind of graveyard table.
You could use an isDeleted column and defien a view which selects all columns except isDeleted with the condition isDeleted=false. Then have all your stps work only with the view.
You could maintain a history table, where you back the record up and time stamp
One of the biggest reasons for not deleting data is that it may be required for a relation - for example the the user may decide to delete an old customer from the database, but you still need the customer record because it is referenced by old invoices (which may have a much longer lifespan).
Based on this the best solution is often the "IsDeleted" type of column, combined with a view (Quassnoi has mentioned partitioning, which can help with performance issues that might pop up due to a lot of invisible data).
You can partition your tables on the DELETED column and define the views which would include the condition:
… AND deleted = 0
This will make the queries over the active data just as simple and efficient.
Well, if you were using SqlServer you can use triggers, which will allow you to move the record to a deleted table.