I am designing a MySQL database for a new project. I will be importing 50-60 MB of data on a daily basis.
There will be a main table with a primary key. Then there will be child tables with their own primary key and a foreign key pointing back to the main table.
New data has to be parsed from a giant text file and then some minor manipulations made prior to importing into the master database. The parsing and import operation may involve a significant amount of troubleshooting so I want to import new data into a temporary database and ensure its integrity before adding to the master.
For this reason, I thought initially to parse and import new data into a separate, temporary database each day. In this way, I would be able to inspect the data prior to adding to the master and at the same time I would have each day's data stored as a separate database should I ever need to rebuild the master later on from the individual temporary databases.
I am considering the use of primary keys / foreign keys with the InnoDB engine in order to maintain relational integrity across tables. This means I have to worry about auto-increment ids (primary key) not having any duplicates when I go to import the new data each day.
So, given this situation, what would be best?
Make a copy of the master and import directly into the copy of the master each day. Replace existing master with the new copy.
Import new data into a temporary database each day but change auto-increment start value of the primary keys to be greater than the maximum in the master. Would I then also change the auto-increment values for the primary keys for all tables (main table and its children)?
Import new data into a temporary database each day, not worrying about the primary key values. Find some other way to merge the temporary database with the master without collisions of the primary keys? If using this strategy, how can I update the primary key in the main table for the new data while making sure all the relationships with the child tables remain correct?
I'm not sure this is as complicated as you are making it?
Why not just do this:
Import raw data into temporary table (why does it have to be a separate database?)
Run your transformations/integrity checks on the temporary table.
When the data is good, insert it directly into the master table.
Use auto incrementing ids on the master table that are not dependent on your data being imported. That allows you to have a unique id and the original ids that might have existed in your import.
Add a field to your master table(s) that gives you a record of which import the records came from.
In addition to copying the data to your master table, make a log that ties back to the data you merged. Helps you back out the data if you find it's wrong/bad and gives you an audit trail.
In the end just set up a sandbox database, write a bunch of stored procedures and test the crap out of it. =)
Related
We have the below requirement:
Currently, we get the data from source (another server, another team, another DB) into a temp DB (via batch jobs) and after we get data into our temp DB, we process the data, transform and update our primary DB with the difference (i.e. the records that changed or the newly added records).
Source->tempDB (daily recreated)->delta->primaryDB
Requirement:
- To delete the data in primary DB once its deleted in source.
Ex: suppose a record with ID=1 is created in source, it comes to temp DB and eventually makes it to primary DB. When this record is deleted in source, it should get deleted in primary DB also.
Challenge:
How do we delete from primary DB when there is nothing to refer to in temp DB (since the record is already deleted in source, nothing comes in tempDB).
Naive approach:
- We can clean up primary DB, before every transform and load afresh. However, it takes a significant amount of time to clean up and populate primary DB everytime.
You could create triggers on each table that fills a history table with deleted entries. Synch that over to your tempDB and use it to delete stuff i your primary DB.
You either want one "delete-history-table" per table or a combined history table that also includes the tablename which triggered the deletion.
You might want to look into SQL Compare or other tools for synching tables.
If you have access to tempDB and primeDB (same server or linked servers) at the same time you could also try a
delete *
from primeBD.Tablename
where not exists (
select 1
from tempDB.Tablename where id = primeDB.Tablename.Id
)
which will perform awfully - ask your db designers.
In this scenorio if TEMPDB & Primary DB have no direct reference then can use track event notification on database level .
Here is the link i got for same :
https://www.mssqltips.com/sqlservertip/2121/event-notifications-in-sql-server-for-tracking-changes/
I'm developing an Android application in which the data is stored in a SQLite database.
I have made sync with a MySQL database, in the web, to where I'm sending the data stored in the SQLite in the device.
The problem is that I don't know how to maintain the relations between tables, because the primary keys are going to be updated with AUTO_INCREMENT, and the foreign keys remain the same, breaking the relations between tables.
If this is a full migration, don't use auto increment during migration - create tables with normal columns. Use ALTER TABLE to change the model after import.
For incremental sync, the easiest way I see is additional column in each MySQL table called sqlite_id and filled with original id. Then you can update references using UPDATE (with joins).
Alternatives involve temporary tables for storing data and an auxiliary table used for pairing. Tedious for bigger data model.
The approach I tend to use, if possible, is to avoid auto increment in such situations. I have usaully an auxiliary table with four columns like this: t_import(tablename, operationid, sqlite_id, mysqlid).
Process is the following:
Import the primary keys into t_import. Use operationid to separate parallel imports if needed.
Generate new keys for data tables and store them into t_import table. This can be combined with step one.
Import the actual data and use t_import for setting new primary keys and restore relations.
That should work for most scenarios I know about.
Thanks or the help, you have given me some ideas.
I will try to add a id2 field to the tables that will store the same value as the primary key (_id)
When I send the information from SQLite to MySQL and the primary key is incremented I will have the id2 field with the original value of the primary key so that I can compare it with the foreign key of the other tables and update it.
Let’s see if it works.
Thanks
I'm aware that copying data will only copy the data and no constraints, I'll lose my pk and fk etc.
But what if i were to copy the data back into a table with primary keys and foreign. because i want to remodel my db on workbench but don't want to lose the data i have imput into my tables so i was thinking of making a copy deleting the original remodelling the db and forward engineering and copying the data back into the table will this work?
Not sure if I gut your question right but from what it looks you only need a new table with the data to mass with. have you tried using CREATE TABLE AS SELECT?
http://dev.mysql.com/doc/refman/5.0/en/create-table-select.html
CREATE TABLE t AS SELECT * from old_t
Finally reached data migration part of my Project and now trying to move data from MySQL to SQL Server.
SQL Server has new schema (mapping is not always one to one).
I am trying to use SSIS for the conversion, which I started learning today morning.
We have customer and customer location table in MySQL and equivalent table in SQL Server. In SQL server all my tables now have surrogate key column (GUID) and I am creating the same in Script Component.
Also note that I do have a primary key in current mysql tables.
What I am looking for is how I can add child records to customer location table with newly created guid as parent key.
I see that SSIS have Foreach loop container, is this of any use here.
if not another possibility that I can think of is create two Data Flow Task and [somehow] just before the master data is sent to Destination Component [Table] on primary dataflow task , add a variable with newly created GUID and another with old PrimaryID, which will be used to create source for DataTask Flow for child records.
May be to simplyfy , this can also be done once datatask for master is complete and then datatask for child reads this master data and inserts child records from MySQL to SQL Server table. This would though mean that I have to load all my parent table records back into memory.
I know this is all too confusing and it is mainly because I am very confused :-(, to bear with me and if you want more information let me know.
I have been through may links that i found through google search but none of them really explains( or I was not able to uderstand) how the process is carried out.
Please advise
regards,
Mar
** Edit 1**
after further searching and refining key words i found this link in SO and going through it to see if it can be used in my scenario
How to load parent child data found in EDI 823 lockbox file using SSIS?
OK here is what I would do. Put the my sql data into staging tables in sql server that have identity columns set up and an extra column for the eventual GUID which will start out as null. Now your records have a primary key.
Next comes the sneaky trick. Pick a required field (we use last_name) and instead of the real data insert the value form the id field in the staging table. Now you havea record that has both the guid and the id in it. Update the guid field in the staging table by joing to it on the ID and the required field you picked out. Now update the last_name field with the real data.
To avoid the sneaky trick and if this is only a onetime upload, add a column to your tables that contains the staging table id. Again you can use this to get the guid for inserting to related tables. Then when you are done, drop the extra column.
You are aware that there are performance issues involved with using GUIDs? Make sure not to make them the clustered index (as the PK they will be by default unless you specify differntly) and use newsequentialid() to populate them. Why are you using GUIDs? If an identity would work, it is usually better to use it.
Is there a way to keep a timestamped record of every change to every column of every row in a MySQL table? This way I would never lose any data and keep a history of the transitions. Row deletion could be just setting a "deleted" column to true, but would be recoverable.
I was looking at HyperTable, an open source implementation of Google's BigTable, and this feature really wet my mouth. It would be great if could have it in MySQL, because my apps don't handle the huge amount of data that would justify deploying HyperTable. More details about how this works can be seen here.
Is there any configuration, plugin, fork or whatever that would add just this one functionality to MySQL?
I've implemented this in the past in a php model similar to what chaos described.
If you're using mysql 5, you could also accomplish this with a stored procedure that hooks into the on update and on delete events of your table.
http://dev.mysql.com/doc/refman/5.0/en/stored-routines.html
I do this in a custom framework. Each table definition also generates a Log table related many-to-one with the main table, and when the framework does any update to a row in the main table, it inserts the current state of the row into the Log table. So I have a full audit trail on the state of the table. (I have time records because all my tables have LoggedAt columns.)
No plugin, I'm afraid, more a method of doing things that needs to be baked into your whole database interaction methodology.
Create a table that stores the following info...
CREATE TABLE MyData (
ID INT IDENTITY,
DataID INT )
CREATE TABLE Data (
ID INT IDENTITY,
MyID INT,
Name VARCHAR(50),
Timestamp DATETIME DEFAULT CURRENT_TIMESTAMP)
Now create a sproc that does this...
INSERT Data (MyID, Name)
VALUES(#MyID,#Name)
UPDATE MyData SET DataID = ##IDENTITY
WHERE ID = #MyID
In general, the MyData table is just a key table. You then point it to the record in the Data table that is the most current. Whenever you need to change data, you simply call the sproc which Inserts the new data into the Data table, then updates the MyData to point to the most recent record. All if the other tables in the system would key themselves off of the MyData.ID for foreign key purposes.
This arrangement sidesteps the need for a second log table(and keeping them in sync when the schema changes), but at the cost of an extra join and some overhead when creating new records.
Do you need it to remain queryable, or will this just be for recovering from bad edits? If the latter, you could just set up a cron job to back up the actual files where MySQL stores the data and send it to a version control server.