Can I do Change Data Capture with MariaDb's Automatic Data Versioning - mysql

We're using MariaDb in production and we've added a MariaDb slave so that our data team can perform some ETL tasks from this slave to our datawarehouse. However, they lack a proper Change Data Capture feature (i.e. they want to know which rows from the production table changed since yesterday in order to query rows that actually changed).
I saw that MariaDb's 10.3 had an interesting feature that allowed to perform a SELECT on an older version of a table. However, I haven't found resources that supported the idea that it could be used for CDC, any feedback on this feature?
If not, we'll probably resort to streaming the slave's binlogs to our datawarehouse but that looks challenging..
Thanks for your help!

(As a supplement to Stefans answer)
Yes, the System-Versioning can be used for CDC because the validity-period in ROW_START (Object starts to be valid) and ROW_END (Object is now invalid) can be interpreted when an INSERT-, UPDATE- or DELETE-query happened. But it's more cumbersome as with alternative CDC-variants.
INSERT:
Object was found for the first time
ROW_START is the insertion time
UPDATE:
Object wasn't found for the first time
ROW_START is the update time
DELETE:
ROW_END lies in the past
there is no new entry for this object in the next few lines
I'll add a picture to clarify this.
You can see that this versioning is space saving because you can combine the information about INSERT and DELETE of an object in one line, but to check for DELETEs is costly.
In the example above I used a Table with a clear Primary Key. So a check for the-same-object is easy: just look at the id. If you want to capture changes in talbes with an key-combination this can also make the whole process more annoying.
Edit: another point is that the protocol-Data is kept in the same table as the "real" data. Maybe this is faster for an INSERT than known alternativ solution like the tracking per TRIGGER (like here), but if changes are made quite frequent on the table and you want to process/analyse the CDC-Data this can cause performance problems.

MariaDB supports System-Versioned Tables since version 10.3.4. System version tables are specified in the SQL:2011 standard. They can be used for automatically capturing previous versions of rows. Those versions can then be queried to retrieve their values as they have been set at a specific point in time.
The following text and code example is from the official MariaDB documentation
With system-versioned tables, MariaDB Server tracks the points in time
when rows change. When you update a row on these tables, it creates a
new row to display as current without removing the old data. This
tracking remains transparent to the application. When querying a
system-versioned table, you can retrieve either the most current
values for every row or the historic values available at a given point
in time.
You may find this feature useful in efficiently tracking the time of
changes to continuously-monitored values that do not change
frequently, such as changes in temperature over the course of a year.
System versioning is often useful for auditing.
With adding SYSTEM VERSIONING to a newly created or an already existing table (using ALTER), the table will be expanded by row_start and row_end time stamp columns which allow retrieving the record valid within the time between the start and the end timestamps.
CREATE TABLE accounts (
id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(255),
amount INT
) WITH SYSTEM VERSIONING;
It is then possible to retrieve data as it was at a specific time (with SELECT * FROM accounts FOR SYSTEM_TIME AS OF '2019-06-18 11:00';), all versions within a specific time range
SELECT * FROM accounts
FOR SYSTEM_TIME
BETWEEN (NOW() - INTERVAL 1 YEAR)
AND NOW();
or all versions at once:
SELECT * FROM accounts
FOR SYSTEM_TIME ALL;

Related

Does mysql can give each row a specified lifetime to be deleted automatically?

I'd like to know if mysql(or mariadb) offers a function for the expiration that a row can be removed automatically in the DB without using any extra scheduler program nor using any SQL like DELETE.
This should happen or define when you create a table so that once INSERT occurs it starts to manage it.
There are many related questions here:
MySQL how to make value expire?
Remove Mysql row after specified time
MySQL give rows a lifetime
However, I couldn't find the answer. I am not curious about using WHERE nor DELETE.
Is it even possible?
Yes for same you can create an event by this way
CREATE EVENT lifetime ON SCHEDULE
EVERY 1 DAY STARTS '14:05:44' ENDS '14:05:46'
ON COMPLETION NOT PRESERVE
ENABLE
DO BEGIN
// put your delete query here with where clause by calculate your exp date
END

DB multiple table update while permanent access

I have a set of tables in a MySql database which contain a set of related data (50 000 rows total, so low volume), which are accessed all the time (7 million/day) . Periodically (let's say once a day) I need to update ALL the data in all the tables (full refresh).
I'm considering 2 possibilities:
use transactions, but I'm not sure how it will work with reads/locks
use versioning: adding a version column in all tables and set all rows on the same "publication" with the same version. The next publication will have a version+1, then the lower version rows can be deleted. The current version is stored in a parameter table allowing the reading query to always pick the latest available version.
Anybody has experimented with both solutions? Or any different/better solution?
Thanks
Replacing an entire table
CREATE TABLE new LIKE real;
populate `new` with the new stuff -- the slow part
RENAME TABLE real TO old,
new TO real; -- atomic and fast.
Replacing an entire database: Do the above for each table, but hold off to do the RENAMEs until all the other work is done. Then do all of them in a single RENAME TABLE statement.
No locking, no transactions, no nothing.

easiest way to know when a MySQL database was last accessed

I have MySQL tables that are all InnoDB.
We have so many copies of various databases spread across multiple servers (trust me we're talking hundreds here), and many of them are not being queried at all.
How can I get a list of the MAX(LastAccessDate) for example for all tables within a specific database? Esp. considering that they are InnoDB tables.
I would prefer knowing even where the "select" query was run, but would settle for "insert/update" as well, since, if a db hasn't changed in a long time, it's probably dead.
If you have a table that always gets values inserted you can add a trigger to the update/insert. Inside this trigger you can set the current timestamp in a dedicated database, including the name of the database from which the insert took place.
This way the only requirement of your database is that it supports triggers.
Alternatively you could take a look this link:
odify date and create date for a table can be retrieved from sys.tables catalog view. When any structural changes are made the modify date is updated. It can be queried as follows:
USE [SqlAndMe]
GO
SELECT [TableName] = name,
create_date,
modify_date
FROM sys.tables
WHERE name = 'TransactionHistoryArchive'
GO
sys.tables only shows modify date for structural changes. If we need to check when was the tables last updated or accessed, we can use dynamic management view sys.dm_db_index_usage_stats. This DMV returns counts of different types of index operations and last time the operation was performed.
It can be used as follows:
USE [SqlAndMe]
GO
SELECT [TableName] = OBJECT_NAME(object_id),
last_user_update, last_user_seek, last_user_scan, last_user_lookup
FROM sys.dm_db_index_usage_stats
WHERE database_id = DB_ID('SqlAndMe')
AND OBJECT_NAME(object_id) = 'TransactionHistoryArchive'
GO
last_user_update – provides time of last user update
last_user_* – provides time of last scan/seek/lookup
It is important to note that sys.dm_db_index_usage_stats counters are reset when SQL Server service is restarted.
Hope This Helps!

Detecting database change

I have a database intensive application that needs to run every couple hours. Is there a way to detect whether a given table has changed since the last time this application ran?
The most efficient way to detect changes is this.
CHECKSUM TABLE tableName
A couple of questions:
Which OS are you working on?
Which storage engine are you using?
The command [http://dev.mysql.com/doc/refman/5.5/en/show-table-status.html](SHOW TABLE STATUS) can display some info depending on storage engine though.
It also depends on how large is the interval between runs of your intensive operation.
The most precise way I believe is with the use of triggers (AFTER INSERT/UPDATE) as #Neuticle mentioned, and just store the CURRENT_TIMESTAMP next to the table name.
CREATE TABLE table_versions(
table_name VARCHAR(50) NOT NULL PRIMARY KEY,
version TIMESTAMP NOT NULL
);
CREATE TRIGGER table_1_version_insert AFTER INSERT
ON table_1
FOR EACH ROW
BEGIN
REPLACE INTO table_versions VALUES('table_1', CURRENT_TIMESTAMP);
END
Could you set a trigger on the tables you want to track to add to a log table on insert? If that would work you only have to read the log tables on each run.
Use timestamp. Depending upon your needs you can set it to update on new rows, or just changes to existing rows. Go here to see a reference:
http://dev.mysql.com/doc/refman/5.0/en/timestamp-initialization.html
A common way to detect changes to a table between runs is with a query like this:
SELECT COUNT(*),MAX(t) FROM table;
But for this to work, a few assumptions must be true about your table:
The t column has a default value of NOW()
There is a trigger that runs on UPDATE and always sets the t column to NOW().
Any normal changes made to the table will then cause the output of the above query to change:
There are a few race conditions that can make this sort of check not work in some instances.
Have used CHECKSUM TABLE tablename and that works just splendid.
Am calling it from an AJAX request to check for table updates. If changes are found a screen refresh is performed.
For database "myMVC" and table "detail" it returns one row with fields "table" and "Checksum" set to "mymvc.detail" and "521719307" respectively.

SERIAL-like INT column

I have an app where depending on the type of transaction being added or updated, the ticket number may or may not increment. I can't use a SERIAL datatype for ticket number because it would increment regardless of the transaction type, so I defined ticket number as an INT. So in a multi-user environment if user A is adding or updating a transaction and user B is also doing the same, I test for tran type and if next ticket number is required, then
LET ticket = (SELECT MAX(ticket) [WITH ADDLOCK or UPDLOCK?] FROM transactions) + 1
However this has to be done exactly when the row is being committed or troubles will begin. Can you think of a better way of doing this with: Informix, Oracle, MySQL, SQL-Server, 4Js/Genero or other RDBMS? This is one main factor which will determine what RDBMS I'm going to re-write my app in.
With the Informix DBMS, the SERIAL column will not change after it is inserted; indeed, you cannot update a SERIAL value at all. You can insert a new one with either 0 as the value - in which case a new value is generated - or you can insert some other value. If the other value already exists and there is a unique constraint, that will fail; if it does not exist, or if there is no unique constraint on the serial column, then it will succeed. If the value inserted is larger than the largest value previously inserted, then the next number to be inserted will be one larger again. If the number inserted is smaller, or negative, then there is no effect on the next number.
So, you could do your update without changing the value - no problem. If you need to change the number, you will have to do a delete and insert (or insert and delete), where the insert has a zero in it. If you prefer consistency and you use transactions, you could always delete, and then (re)insert the row with the same number or with a zero to trigger a new number. This assume you have a programming language running the SQL; I don't think you can tweak ISQL and Perform to do that automatically.
So, at this point, I don't see the problem on Informix.
With the appropriate version of IDS (anything that is supported), you can use SEQUENCE to control the values inserted too. This is based on the Oracle syntax and concept; DB2 also supports this. Other DBMS have other equivalent (but different) mechanisms for handling the auto-generated numbers.
That's what sequences were created for and which is supported by most databases (MySQL being the only one that does not have sequences - not 100% sure about Informix though)
Any algorithm that relies on the SELECT MAX(id) anti-pattern is either dead-slow in a multi-user environment or will simply not work correctly in a multi-user environment.
If you need to support MySQL as well, I'd recommend to use the "native" "auto increment" type in each database (serial for PostgreSQL, auto_increment for MySQL, identity for SQL Server, sequence + trigger in Oracle and so on) and let the driver return the generated ID value
In JDBC there is a getGeneratedKeys() method and I'm sure other interfaces have something similar.
From your tags it's hard to tell what database you are using.
For SQL Server (since it's listed) I suggest
ticket_num = (SELECT MAX(ticket_number) FROM transactions with (updlock)) + 1