Adding a time dimension to MySQL cells - mysql

Is there a way to keep a timestamped record of every change to every column of every row in a MySQL table? This way I would never lose any data and keep a history of the transitions. Row deletion could be just setting a "deleted" column to true, but would be recoverable.
I was looking at HyperTable, an open source implementation of Google's BigTable, and this feature really wet my mouth. It would be great if could have it in MySQL, because my apps don't handle the huge amount of data that would justify deploying HyperTable. More details about how this works can be seen here.
Is there any configuration, plugin, fork or whatever that would add just this one functionality to MySQL?

I've implemented this in the past in a php model similar to what chaos described.
If you're using mysql 5, you could also accomplish this with a stored procedure that hooks into the on update and on delete events of your table.
http://dev.mysql.com/doc/refman/5.0/en/stored-routines.html

I do this in a custom framework. Each table definition also generates a Log table related many-to-one with the main table, and when the framework does any update to a row in the main table, it inserts the current state of the row into the Log table. So I have a full audit trail on the state of the table. (I have time records because all my tables have LoggedAt columns.)
No plugin, I'm afraid, more a method of doing things that needs to be baked into your whole database interaction methodology.

Create a table that stores the following info...
CREATE TABLE MyData (
ID INT IDENTITY,
DataID INT )
CREATE TABLE Data (
ID INT IDENTITY,
MyID INT,
Name VARCHAR(50),
Timestamp DATETIME DEFAULT CURRENT_TIMESTAMP)
Now create a sproc that does this...
INSERT Data (MyID, Name)
VALUES(#MyID,#Name)
UPDATE MyData SET DataID = ##IDENTITY
WHERE ID = #MyID
In general, the MyData table is just a key table. You then point it to the record in the Data table that is the most current. Whenever you need to change data, you simply call the sproc which Inserts the new data into the Data table, then updates the MyData to point to the most recent record. All if the other tables in the system would key themselves off of the MyData.ID for foreign key purposes.
This arrangement sidesteps the need for a second log table(and keeping them in sync when the schema changes), but at the cost of an extra join and some overhead when creating new records.

Do you need it to remain queryable, or will this just be for recovering from bad edits? If the latter, you could just set up a cron job to back up the actual files where MySQL stores the data and send it to a version control server.

Related

delete entry in DB without reference

We have the below requirement:
Currently, we get the data from source (another server, another team, another DB) into a temp DB (via batch jobs) and after we get data into our temp DB, we process the data, transform and update our primary DB with the difference (i.e. the records that changed or the newly added records).
Source->tempDB (daily recreated)->delta->primaryDB
Requirement:
- To delete the data in primary DB once its deleted in source.
Ex: suppose a record with ID=1 is created in source, it comes to temp DB and eventually makes it to primary DB. When this record is deleted in source, it should get deleted in primary DB also.
Challenge:
How do we delete from primary DB when there is nothing to refer to in temp DB (since the record is already deleted in source, nothing comes in tempDB).
Naive approach:
- We can clean up primary DB, before every transform and load afresh. However, it takes a significant amount of time to clean up and populate primary DB everytime.
You could create triggers on each table that fills a history table with deleted entries. Synch that over to your tempDB and use it to delete stuff i your primary DB.
You either want one "delete-history-table" per table or a combined history table that also includes the tablename which triggered the deletion.
You might want to look into SQL Compare or other tools for synching tables.
If you have access to tempDB and primeDB (same server or linked servers) at the same time you could also try a
delete *
from primeBD.Tablename
where not exists (
select 1
from tempDB.Tablename where id = primeDB.Tablename.Id
)
which will perform awfully - ask your db designers.
In this scenorio if TEMPDB & Primary DB have no direct reference then can use track event notification on database level .
Here is the link i got for same :
https://www.mssqltips.com/sqlservertip/2121/event-notifications-in-sql-server-for-tracking-changes/

How to solve a real time dwh delete process?

I am trying to create a near real time dwh. My first attempt is every 15 minutes load a table into my application from my DWH.
I would like to avoid all the possible problems that a near real time DWH can face. One of those problems is query an empty table that shows the value for a multiselect html tag.
To solve this I have thought the following solution but I do not know if there exists a standard to solve this kind of problem.
I create a table like this to save the possible values of the multiselect:
CREATE TABLE providers (
provider_id INT PRIMARY KEY,
provider_name VARCHAR(20) NOT NULL,
delete_flag INT NOT NULL
)
Before the insert I update the table like this:
UPDATE providers set my_flag=1
I insert rows with an ETL process like this:
INSERT INTO providers (provider_name, delete_flag) VALUES ('Provider1',0)
From my app I query the table like this:
SELECT DISTINCT provider_name FROM providers
While the app still working and selecting all providers without duplicated (The source can delete, add or update one provider, so I always have to still updated respect the source) and without showing an error because table is empty I can run this statement just after the insert statement:
DELETE FROM providers WHERE delete_flag=1
I think that this is a good solution for small tables, or big tables with few changes, but what happens when a table is big? Exist some standard to solve this kind of problems?
We can not risk user usability because we are updating data.
There are two aproaches to publich a bulk change of a dimenstion without taking a maintainance window that would interupt the queries.
The first one is simple using a transactional concept, but performs bad for large data.
DELETE the replaced dimension records
INSERT the new or changed dimension records
COMMIT;
Note that you need no logical DELETE flag as the changes are visible only after the COMMIT - so the table is never empty.
As mentioned this approach is not suitable if you have a large dimension with lot of changes. In such case you may use the EXCHANGE PARTITION feature as of MySQL 5.6
You define a temporary table with he same structure as your dimension table, that is partitioned with only one partition containing all data.
CREATE TABLE dim_tmp (
id INT NOT NULL,
col1 VARCHAR(30),
col2 VARCHAR(30)
)
PARTITION BY RANGE (id) (
PARTITION pp VALUES LESS THAN (MAXVALUE)
);
Populate the table with the complete new dimension definition and switch this temporary table with your dimension table.
ALTER TABLE dim_tmp EXCHANGE PARTITION pp WITH TABLE dim;
After this statement the data from the temporary table will be stored (published) in your dimension table (new definition) and the old state of the dimension will be stored in the temporary table.
Please check the documentation link above for constraints of this feature.
Disclaimer: I use this feature in Oracle DB and I have no experience with it in MySQL.

MySQL: Best way to update a large table

I have a table with huge amount of data. The source of data is an external api. Every few hours, I need to sync the database so that the changes are up to date from the external api. I am doing a full sync (api doesn't allow delta sync).
While sync happens, I want to make sure that the data from the database is also available for read. So, I am following below steps:
I have a cloumn in the table which acts as a flag for whether or not data is readable. Only the data with flag set is marked for read.
I am inserting all the data from the api into the table.
Once all the data is written, I am deleting all the data in the table with flag set.
After deletion, I am updating the table and setting the flag for all the rows.
Table has around ~50 million rows and is expected to grow. There is a customerId field in the table. Sync usually happens based on customerId by passing it to the api.
My problem is, step 3 and 4 above are taking a lot of time. Queries are something like:
Step 3 --> delete from foo where customer_id=12345678 and flag=1
Step 4 --> update foo set flag=1 where customer_id=12345678
I have tried partitioning the table based on customer_id and it works great where customer_id has less number of rows but for some customer_id, the number of rows in each partition itself goes till ~5 million.
Around 90% of data doesn't change between two syncs. How can I make this fast?
I was thinking of using just the update queries instead of insert queries and then check if there was any update. If not, I can issue an insert query for the same row. This way any updates will be taken care of along with the insert. But I am not sure if the operation will block read queries for this while update is in progress.
For your setup (read only data, full sync), the fastest way to update the table is to not update at all, but to import the data into a different table and to rename it afterwards to make it the new table.
Create a table like your original table, e.g. use
create table foo_import like foo;
If you have e.g. triggers, add them too.
From now on, let the import api write its (full) sync to this new table.
After a sync is done, swap the two tables:
RENAME TABLE foo TO foo_tmp,
foo_import TO foo,
foo_tmp to foo_import;
It will (literally) just require a second.
This command is atomic: it will wait for transactions that access these tables to finish, it will not present a situation where there is no table foo and it will completely fail (and not do anything) if one of the tables doesn't exist or foo_tmp already exists.
As a final step, empty your import table (that now contains your old data) to be ready for your next import:
truncate foo_import;
This will again just require a second.
The rest of your querys probably assume that flag=1. Until (if at all) you update the code to not use the flag anymore, you can set its default value to 1 to keep it compatible, e.g. use
alter table foo modify column flag tinyint default 1;
Since you don't have foreign keys, it doesn't have to bother you, but for others with a similar problem it might be useful to know that foreign keys will get adjusted, so foreign keys that are referencing foo will reference foo_import after renaming the tables. To make them point to the new table foo again, they have to be dropped and recreated. Everything else (e.g. views, queries, procedures) will resolve by the current name, so they will always access the current foo.
CREATE TABLE new LIKE real;
Load `new` by whatever means you have; take as long as needed.
RENAME TABLE real TO old, new TO real;
DROP TABLE old;
The RENAME is atomic and "instantaneous"; real is "always" available.
(I don't see the need for flag.)
OR...
Since you are actually updating a chunk of a table, consider these...
If the chunk is small...
Load the new data into a tmp table
DELETE the old rows
INSERT ... SELECT ... to move the new rows in. (Having the new data already in a table is probably the fastest way to achieve this.)
If the chunk is big, and you don't want to lock the table for "too long", there are some other tricks. But first, is there some form of unique row number for each row for the customer? (I'm thinking about batch-moving a bunch or rows at a time, but need more specifics before spelling it out.)

Importing new data to master database versus temporary database?

I am designing a MySQL database for a new project. I will be importing 50-60 MB of data on a daily basis.
There will be a main table with a primary key. Then there will be child tables with their own primary key and a foreign key pointing back to the main table.
New data has to be parsed from a giant text file and then some minor manipulations made prior to importing into the master database. The parsing and import operation may involve a significant amount of troubleshooting so I want to import new data into a temporary database and ensure its integrity before adding to the master.
For this reason, I thought initially to parse and import new data into a separate, temporary database each day. In this way, I would be able to inspect the data prior to adding to the master and at the same time I would have each day's data stored as a separate database should I ever need to rebuild the master later on from the individual temporary databases.
I am considering the use of primary keys / foreign keys with the InnoDB engine in order to maintain relational integrity across tables. This means I have to worry about auto-increment ids (primary key) not having any duplicates when I go to import the new data each day.
So, given this situation, what would be best?
Make a copy of the master and import directly into the copy of the master each day. Replace existing master with the new copy.
Import new data into a temporary database each day but change auto-increment start value of the primary keys to be greater than the maximum in the master. Would I then also change the auto-increment values for the primary keys for all tables (main table and its children)?
Import new data into a temporary database each day, not worrying about the primary key values. Find some other way to merge the temporary database with the master without collisions of the primary keys? If using this strategy, how can I update the primary key in the main table for the new data while making sure all the relationships with the child tables remain correct?
I'm not sure this is as complicated as you are making it?
Why not just do this:
Import raw data into temporary table (why does it have to be a separate database?)
Run your transformations/integrity checks on the temporary table.
When the data is good, insert it directly into the master table.
Use auto incrementing ids on the master table that are not dependent on your data being imported. That allows you to have a unique id and the original ids that might have existed in your import.
Add a field to your master table(s) that gives you a record of which import the records came from.
In addition to copying the data to your master table, make a log that ties back to the data you merged. Helps you back out the data if you find it's wrong/bad and gives you an audit trail.
In the end just set up a sandbox database, write a bunch of stored procedures and test the crap out of it. =)

ssis Data Migration - Master Detail records with new surrogate keys

Finally reached data migration part of my Project and now trying to move data from MySQL to SQL Server.
SQL Server has new schema (mapping is not always one to one).
I am trying to use SSIS for the conversion, which I started learning today morning.
We have customer and customer location table in MySQL and equivalent table in SQL Server. In SQL server all my tables now have surrogate key column (GUID) and I am creating the same in Script Component.
Also note that I do have a primary key in current mysql tables.
What I am looking for is how I can add child records to customer location table with newly created guid as parent key.
I see that SSIS have Foreach loop container, is this of any use here.
if not another possibility that I can think of is create two Data Flow Task and [somehow] just before the master data is sent to Destination Component [Table] on primary dataflow task , add a variable with newly created GUID and another with old PrimaryID, which will be used to create source for DataTask Flow for child records.
May be to simplyfy , this can also be done once datatask for master is complete and then datatask for child reads this master data and inserts child records from MySQL to SQL Server table. This would though mean that I have to load all my parent table records back into memory.
I know this is all too confusing and it is mainly because I am very confused :-(, to bear with me and if you want more information let me know.
I have been through may links that i found through google search but none of them really explains( or I was not able to uderstand) how the process is carried out.
Please advise
regards,
Mar
** Edit 1**
after further searching and refining key words i found this link in SO and going through it to see if it can be used in my scenario
How to load parent child data found in EDI 823 lockbox file using SSIS?
OK here is what I would do. Put the my sql data into staging tables in sql server that have identity columns set up and an extra column for the eventual GUID which will start out as null. Now your records have a primary key.
Next comes the sneaky trick. Pick a required field (we use last_name) and instead of the real data insert the value form the id field in the staging table. Now you havea record that has both the guid and the id in it. Update the guid field in the staging table by joing to it on the ID and the required field you picked out. Now update the last_name field with the real data.
To avoid the sneaky trick and if this is only a onetime upload, add a column to your tables that contains the staging table id. Again you can use this to get the guid for inserting to related tables. Then when you are done, drop the extra column.
You are aware that there are performance issues involved with using GUIDs? Make sure not to make them the clustered index (as the PK they will be by default unless you specify differntly) and use newsequentialid() to populate them. Why are you using GUIDs? If an identity would work, it is usually better to use it.