Track database table changes - mysql

I'm trying to implement a way to track changes to a table named user and another named report_to Below are their definitions:
CREATE TABLE `user`
(
`agent_eid` int(11) NOT NULL,
`agent_id` int(11) DEFAULT NULL,
`agent_pipkin_id` int(11) DEFAULT NULL,
`first_name` varchar(45) NOT NULL,
`last_name` varchar(45) NOT NULL,
`team_id` int(11) NOT NULL,
`hire_date` date NOT NULL,
`active` bit(1) NOT NULL,
`agent_id_req` bit(1) NOT NULL,
`agent_eid_req` bit(1) NOT NULL,
`agent_pipkin_req` bit(1) NOT NULL,
PRIMARY KEY (`agent_eid`),
UNIQUE KEY `agent_eid_UNIQUE` (`agent_eid`),
UNIQUE KEY `agent_id_UNIQUE` (`agent_id`),
UNIQUE KEY `agent_pipkin_id_UNIQUE` (`agent_pipkin_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
CREATE TABLE `report_to`
(
`agent_eid` int(11) NOT NULL,
`report_to_eid` int(11) NOT NULL,
PRIMARY KEY (`agent_eid`),
UNIQUE KEY `agent_eid_UNIQUE` (`agent_eid`),
KEY `report_to_report_fk_idx` (`report_to_eid`),
CONSTRAINT `report_to_agent_fk` FOREIGN KEY (`agent_eid`) REFERENCES `user` (`agent_eid`) ON DELETE NO ACTION ON UPDATE NO ACTION,
CONSTRAINT `report_to_report_fk` FOREIGN KEY (`report_to_eid`) REFERENCES `user` (`agent_eid`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB DEFAULT CHARSET=utf8
What can change that needs to be tracked is user.team_id, user.active and report_to.report_to_eid. What i currently have implemented is a table that is populated via an update trigger on user that tracks team changes. That table is defined as:
CREATE TABLE `user_team_changes`
(
`agent_id` int(11) NOT NULL,
`date_changed` date NOT NULL,
`old_team_id` int(11) NOT NULL,
`begin_date` date NOT NULL,
PRIMARY KEY (`agent_id`,`date_changed`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
This works fine for just tracking team changes. I'm able to use joins and a union to populate a history view that tracks that change over time for the individual users. The issue of complexity rises when I try to implement tracking for the other two change types.
I have thought about creating additional tables similar to the one tracking changes for teams, but I worry about performance hits due to the joins that will be required.
Another way I have considered is creating a table similar to a view that I have that details the current user state (it joins all necessary user data together from 4 tables), then insert a record on update with a valid until date field added. My concern with that is the amount of space this could take.
We will be using the user change history quite a bit as we will be running YTD, MTD, PMTD and time interval reports with it on an almost daily basis.
Out of the two options I am considering, which would be the best for my given situation?

The options you've presented:
using triggers to populate transaction-log tables.
including a new table with an effective-date columns in the schema and tracking change by inserting new rows.
Either one of these will work. You can add logging triggers to other tables without causing any trouble.
What distinguishes these two choices? The first one is straightforward, once you get your triggers debugged.
The second choice seems to me that it will create denormalized redundant data. That is never good. I would opt not to do that. It is possible with judicious combinations of views and effective-date columns to create history tables that are viewable as the present state of the system. To learn about this look at Prof. RT Snodgrass's excellent book on Developing Time Oriented applications. http://www.cs.arizona.edu/~rts/publications.html If you have time to do an excellent engineering (over-engineering?) job on this project you might consider this approach.
The data volume you've mentioned will not cause intractable performance problems on any modern server hardware platform. If you do get slowdowns on JOIN operations, it's almost certain that the addition of appropriate indexes will completely fix them, as long as you declare all your DATE, DATETIME, and TIMESTAMP fields NOT NULL. (NULL values can mess up indexing and searching).
Hope this helps.

Related

Multi-Table MariaDB Database Partitioning Solution

Looking for some guidance on how to best tackle partitioning on some database tables for the purpose of archiving/deleting data over a certain age. The main reason for this is to resolve some issues in database size.
You can think of the data akin to telemetry data where is is growing over time, but once it enters the database it doesn't change outside of the first 10-15 minutes in the event there is any form of conflicting data that requires the application to update a recent record (max 15 mins).
Current database size is approaching 500GB and is sitting on NVMe storage across a 3x Node Galera cluster in three cities. Backups are becoming increasingly larger and if an SST is needed between nodes this can take a couple of hours to complete which is less than ideal.
The plan to deal with this is by way of archiving, where we plan to off-board historical data to another server (say once a year) with slower storage that can then be backed up once and won't change for 12 months. The historical data will be rarely accessed, and in the event it is our application will handle querying the archive server if older than a certain date instead of the production servers that are relied on heavily for "recent" data.
We have 3x tables per customer, and they reference each other in a sort of heirarchy. There are no foreign keys in the tables, but they do hold references to one another and are used in JOIN queries. Eg. summary table is the top of the hierarchy and holds one record per "event". Under this is the details table and there could be 1-10 detail records sitting under the summary event. Under details is the digits table that could include 0-10 records per detailed record.
CREATE TABLE data below;
CREATE TABLE `summary_X` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`start_utc` datetime DEFAULT NULL,
`end_utc` datetime DEFAULT NULL,
`total_duration` smallint(6) DEFAULT NULL,
`legs` tinyint(4) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `start_utc` (`start_utc`)
) ENGINE=InnoDB
CREATE TABLE `details_X` (
`xid` bigint(20) NOT NULL AUTO_INCREMENT,
`id` int(11) NOT NULL,
`duration` smallint(6) DEFAULT NULL,
`start_utc` timestamp NULL DEFAULT NULL,
`end_utc` timestamp NULL DEFAULT NULL,
`event` varchar(2) DEFAULT NULL,
`event_time` smallint(6) DEFAULT NULL,
`event_a` varchar(7) DEFAULT NULL,
`event_b` varchar(7) DEFAULT NULL,
`ani` varchar(20) DEFAULT NULL,
`dnis` varchar(10) DEFAULT NULL,
`first_time` varchar(30) DEFAULT NULL,
`final_time` varchar(30) DEFAULT NULL,
`digits_count` int(2) DEFAULT 0,
`sys_a` varchar(3) DEFAULT NULL,
`sys_b` varchar(3) DEFAULT NULL,
`log_id_a` varchar(12) DEFAULT NULL,
`seq_a` varchar(1) DEFAULT NULL,
`log_id_b` varchar(12) DEFAULT NULL,
`seq_b` varchar(1) DEFAULT NULL,
`assoc_log_id_a` varchar(12) DEFAULT NULL,
`assoc_log_id_b` varchar(12) DEFAULT NULL,
PRIMARY KEY (`xid`),
KEY `start_utc` (`start_utc`),
KEY `end_utc` (`end_utc`),
KEY `event_a` (`event_a`),
KEY `event_b` (`event_b`),
KEY `id` (`id`),
KEY `final_digits` (`final_digits`),
KEY `log_id_a` (`log_id_a`),
KEY `log_id_b` (`log_id_b`)
) ENGINE=InnoDB
CREATE TABLE `digits_X` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`leg_id` bigint(20) DEFAULT NULL,
`sequence` int(2) NOT NULL,
`digits` varchar(30) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `digits` (`digits`),
KEY `leg_id` (`leg_id`)
) ENGINE=InnoDB
My first thought was to partition on Year, sounds easy enough but we don't have a date column on the digits table, so records here could be orphaned away from their mapped details record and no longer match in a JOIN on the archive server.
We then can also have a similar issue with summary and the timestamps on the "details" records could span multiple years. Eg. Summary event starts at 2021-12-31 23:55:00. First detail record is same timestamp, and then the next detail under the same event could be 2022-01-01 00:11:00. If 2021 partition was archived off to the other server, the 2022 detail would be orphaned and no longer JOIN to the 2021 summary event.
One alternative could be not to partition at all and do SELECT/INSERT/DELETE which isn't practical with the volume of data. Some tables have 30M-40M rows per year so this would be very resource taxing. There are also 400+ customers each with their own sets of tables.
Another I thought of was to add a column to the three tables as a "Year" column we can partition on but would include the Year of first event across all, so all related records can be on the same partitions/server, but this seems like a waste of space and there should be a better way.
Any thoughts or guidance would be appreciated.
To add PARTITIONing will require copying the entire table over. That will involve downtime and disk space. If you can live with that, then...
PARTITION BY RANGE(...) where the expression involves, say, TO_DAYS(...) or possibly TO_SECONDS(...). Then set up cron jobs to add a new partition periodically. (There is nothing automated for such.) And to detach the oldest partition. See Partition for a discussion of the details. (TO_DAYS avoids the need for a 'year' column.)
Note that Partitioning is implemented as several sub-tables under a table. With "transportable tablespaces", you can detach a partition from the big table, turning it into a table unto itself. At that point, you are free to move it to another server of something.
In a situation like yours, I might consider the following.
Write the raw data to a file (perhaps one per day) for archiving;
Insert into a table that will live only briefly; this will be purged by some means frequently;
Update "normalization" tables
"Summarize" the data into Summary Tables, where each set of rows covers one hour (or whatever makes sense).
Write "reports" from the summary table(s).
Be aware that each Partition takes an extra 5.5MB (average), so do not make many partitions. Or do you need only 2, each containing 15 minutes' data?
Meanwhile, I would look carefully at the schema. Can an INT (4 bytes) be turned into a SMALLINT (2 bytes). Can more things be Normalized.
digits_count int(2) -- that is a 4-byte INT; the (2) has no meaning and has been removed in MySQL 8. (MariaDB may follow suit someday.) It sounds like you need only a 1-byte TINYINT UNSIGNED (range: 0..255).
Since this is log info, be aware of Daylight Savings wrt DATETIME. (One hour per year is missing; another hour repeats.) This problem does not occur with TIMESTAMP. Each one takes 5 bytes (unless you include fractional seconds.)
(I can't advise on unnecessary indexes without seeing the queries.) SHOW TABLE STATUS will tell you how much space is being consumed by all the indexes.
Are the 3 tables of similar size?
Re "orphaning" -- You need at least 2 partitions -- one being filled (0-100% full) and an older partition (100% full)
"30M-40M rows per year" times 400 customers. Does that add up to 500 rows inserted per second? Are they INSERTed one row at a time? High speed ingestion
Are there more deletes and selects than inserts? And/or do they involve more than single rows? (I'm fishing for more info go help with some other issues you either have or are threatening to have.) Even with Deletes and no Partitioning, the disk growth will slow down as free space is generated, then reused. ("Rince and repeat.")
Without partitioning, see Huge Deletes . But... DELETEing data from a table does not shrink it disk footprint. However if each 'customer' has 1/400th of the data; and (of course) you do each customer separately, then there may not be any disk problem
I've given you a lot to think about. Answer some of my questions; I may have more advice.

MySQL composite index effect on joins

I have the following SQL query (DB is MySQL 5):
select
event.full_session_id,
DATE(min(event.date)),
event_exe.user_id,
COUNT(DISTINCT event_pat.user_id)
FROM
event AS event
JOIN event_participant AS event_pat ON
event.pat_id = event_pat.id
JOIN event_participant AS event_exe on
event.exe_id = event_exe.id
WHERE
event_pat.user_id <> event_exe.user_id
GROUP BY
event.full_session_id;
"SHOW CREATE TABLE event":
CREATE TABLE `event` (
`id` int(12) NOT NULL AUTO_INCREMENT,
`date` datetime NOT NULL,
`session_id` varchar(64) DEFAULT NULL,
`full_session_id` varchar(72) DEFAULT NULL,
`pat_id` int(12) DEFAULT NULL,
`exe_id` int(12) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `SESSION_IDX` (`full_session_id`),
KEY `PAT_ID_IDX` (`pat_id`),
KEY `DATE_IDX` (`date`),
KEY `SESSLOGPATEXEC_IDX` (`full_session_id`,`date`,`pat_id`,`exe_id`)
) ENGINE=MyISAM AUTO_INCREMENT=371955 DEFAULT CHARSET=utf8
"SHOW CREATE TABLE event_participant":
CREATE TABLE `event_participant` (
`id` int(12) NOT NULL AUTO_INCREMENT,
`user_id` varchar(64) NOT NULL,
`alt_user_id` varchar(64) NOT NULL,
`username` varchar(128) NOT NULL,
`usertype` varchar(32) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `ALL_UNQ` (`user_id`,`alt_user_id`,`username`,`usertype`),
KEY `USER_ID_IDX` (`user_id`)
) ENGINE=MyISAM AUTO_INCREMENT=5397 DEFAULT CHARSET=utf8
Also, the query itself seems ugly, but this is legacy code on a production system, so we are not expected to change it (at least for now).
The problem is that, there is around 36 million record on the event table (in the production system), so there have been frequent crashes of the DB machine due to using temporary;using filesort processing (they provided these EXPLAIN outputs, unfortunately, I don't have them right now. I'll try to update them to this post later.)
The customer asks for a "quick fix" by adding indices. Currently we have indices on full_session_id, pat_id, date (separately) on event and user_id on event_participant.
Thus I'm thinking of creating a composite index (pat_id, exe_id, full_session_id, date) on event- this index comprises of the fields in the join (equivalent to where ?), then group by, then aggregate (min) parts.
This is just an idea because we currently don't have that kind of data volume to test, so we try the best we could first.
My question is:
Could the index above help in the performance ? (It's quite confusing on the effect because I have found two really contrasting results: https://dba.stackexchange.com/questions/158385/compound-index-on-inner-join-table
versus Separate Join clause in a Composite Index, where the latter suggests that composite index on joins won't work and the former that it'll work.
Does this path (adding indices) have hopes ? Or should we forget it and just try to optimize the query instead ?
Thanks in advance for your help :)
Update:
I have updated the full table description for the two related tables.
MySQL version is 5.1.69. But I think we don't need to worry about the ambiguous data issue mentioned in the comments, because it seems there won't be ambiguity for our data. Specifically, for each full_session_id, there is only one "event_exe.user_id" returned (it's just a business logic in the application)
So, what do you think about my 2 questions ?

How to best categorize values in a table

I'm in the process of designing a new database for a project at work. I want to create a table that stores Assignments for a digital classroom. Each Assignment can be one of 2 categories: "Individual" or "Group".
The first implementation that comes to mind is the following:
CREATE TABLE `assignments` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`title` varchar(255) DEFAULT NULL,
`category` varchar(10) NOT NULL DEFAULT 'individual',
PRIMARY KEY (`id`),
KEY `category_index` (`category`(10))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
I would then select all assignments of a given category with:
SELECT title FROM assignments WHERE category = "individual"
However, because we've had performance issues in the past, I'm trying to optimize the design as much as possible. As such, I'm wondering whether or not storing the category as a VARCHAR is a good idea (considering the table will get quite large)? Would indexing an INT perform better over a VARCHAR?
Aside from just performance, I'm also curious what would be considered a good solution from a design-perspective. Suggestions?

What is the most efficient way to check against a huge MySQL table?

I have a service in which users may "like" content posted by other users. Currently, the system doesn't filter out content that the user has already liked, which is undesirable behavior. I have a table called LikeRecords which stores a userID, a contentID, and a timePlaced timestamp. The idea is to use this table to filter content that a user has already liked when choosing what to display.
The thing is, I'm a MySQL amateur, and don't understand scaling and maintenance well. Even though I only have about 1,500 users, this table already has 45,000 records. I'm worried that as my service grows to tens or hundreds of thousands of users, this table will explode into millions and become slow since the filter operation would be called very frequently.
Is there a better design pattern I could use here, or a maintenance technique I should use?
EDIT: Here is the query for building the table in question:
CREATE TABLE `likerecords` (
`likeID` int(11) NOT NULL AUTO_INCREMENT,
`userID` int(10) unsigned NOT NULL,
`orderID` int(11) NOT NULL,
`timePlaced` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`special` tinyint(1) NOT NULL,
PRIMARY KEY (`likeID`)
) ENGINE=InnoDB AUTO_INCREMENT=44775 DEFAULT CHARSET=latin1
I would be using it to filter results in other tables, such as an "orders" table.

First database with referential integrity constraints -- suggestions, feedback, errors?

TARGET_RDBMS: MySQL-5.X-InnoDB ("X" equals current stable release)
BACKGROUND: Building my first database with true referential integrity constraints, in an effort to get feedback, after creating the "real" DDL, I've made an abstraction that I believe covers the "feel" of the database; this is only 3 tables of about 20, all with referential integrity constraints; only pattern I see that is missing is a composite key table, which does not have data to be dumped in right now anyway, so I'm just focus on the first iteration.
Sample Data / Unit Test: One thing I do not know is how to build out a sample data set that will offer 100% coverage of the referential integrity modeled -- AND build "Unit Test" around that sample data and this DDL:
Sample DLL:
(Note: Just to be clear, the LEGEND and naming standards are JUST for this example, which I've abstracted from the "real" database. The column names are robotic in nature, and meant to make the meaning and relationship of a given instance as clear as possible. If you have suggestions on the notation system used, please feel free to comment. I'm open to any suggestions. Thanks!)
CREATE DATABASE sampleDB;
use sampleDB;
# ###############
# LEGEND
# - sID = surrogate key
# - nID = natural key
# - cID = common/shared across tables, but NOT unique/natural-key
# - PK = Primary Key
# - FK = Foreign Key
# - data01 = Sample data (non-key,not-shared-across-tables)
# - data02 = Sample data NOT NULL (non-key,not-shared-across-tables)
#
# - uID = user defined unique/natural key (NOTE: not used)
# ###############
# Behavior
# - create_timestamp (NOT NULL, updated on record creation, NOT update)
# - update_timestamp (NOT NULL, updated on record creation AND updates)
CREATE TABLE `TABLE_01` (
`TABLE_01_sID_PK` MEDIUMINT NOT NULL AUTO_INCREMENT,
`TABLE_01_cID` int(8) NOT NULL,
`TABLE_01_data01` varchar(128) default NULL,
`TABLE_01_data02` varchar(128) default NULL,
`create_timestamp` DATETIME DEFAULT NULL,
`update_timestamp` TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`TABLE_01_sID_PK`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `TABLE_02` (
`TABLE_02_sID_PK` MEDIUMINT NOT NULL AUTO_INCREMENT,
`TABLE_02_nID_FK__TABLE_01_sID_PK` int(8) NOT NULL,
`TABLE_02_cID` int(8) NOT NULL,
`TABLE_02_data01` varchar(128) default NULL,
`TABLE_02_data02` varchar(128) NOT NULL,
`create_timestamp` DATETIME DEFAULT NULL,
`update_timestamp` TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`TABLE_02_sID_PK`),
FOREIGN KEY (TABLE_02_nID_FK__TABLE_01_sID_PK) REFERENCES TABLE_01(TABLE_01_sID_PK),
INDEX `TABLE_02_nID_FK__TABLE_01_sID_PK` (`TABLE_02_nID_FK__TABLE_01_sID_PK`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `TABLE_03` (
`TABLE_03_sID_PK` MEDIUMINT NOT NULL AUTO_INCREMENT,
`TABLE_03_nID_FK__TABLE_01_sID_PK` int(8) NOT NULL,
`TABLE_03_nID_FK__TABLE_02_sID_PK` int(8) NOT NULL,
`TABLE_03_cID` int(8) NOT NULL,
`TABLE_03_data01` varchar(128) default NULL,
`TABLE_03_data02` varchar(128) NOT NULL,
`create_timestamp` DATETIME DEFAULT NULL,
`update_timestamp` TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`TABLE_03_sID_PK`),
FOREIGN KEY (TABLE_03_nID_FK__TABLE_01_sID_PK) REFERENCES TABLE_01(TABLE_01_sID_PK),
FOREIGN KEY (TABLE_03_nID_FK__TABLE_02_sID_PK) REFERENCES TABLE_02(TABLE_02_sID_PK),
INDEX `TABLE_03_nID_FK__TABLE_01_sID_PK` (`TABLE_03_nID_FK__TABLE_01_sID_PK`),
INDEX `TABLE_03_nID_FK__TABLE_02_sID_PK` (`TABLE_03_nID_FK__TABLE_02_sID_PK`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
SHOW TABLES;
# DROP DATABASE `sampleDB`;
# #######################
# View table definition
# DESC inserttablename;
# #######################
# View table create statement
# SHOW CREATE TABLE example;
Questions:
Any and all feedback on missing, wrong, or "better" ways to do this database build are welcome. If you have questions, just comment -- and I'll respond ASAP. Again, thanks~!
UPDATE (1):
Just added "MEDIUMINT NOT NULL AUTO_INCREMENT" to the PKs -- not sure how I left that off.
First of all, I want to applaud you for defining a standard. There is no end to how much it will come to help you in the future.
Having said that, a couple of very subjective opinions from my part:
I don't like to embed type information in names, such as "TABLE_PERSON" or "PERSON_T" because it becomes confusing the second you replace a table with a view instead. At this point you could of course search and replace "PERSON_T" with "PERSON_VW" instead, but it kind of misses the point :)
The same goes for columns (although i can't see this in your example). Think of the "n_is_dead" column that gets changed from numeric to varchar.
Can a row exist in a table without being created (create_timestamp)? Declare columns as NOT NULL if they really can't be null. In fact, I start of having NOT NULL on most of my columns because it makes me think harder about the nature of the data.
I'm a fan of naming the primary key column something other than ID. For example
company(company_id, etc)
person(person_id, company_id, firstname etc)
I've heard some people have problems with O/R mappers that want you to have the primary key named "ID" at all times, but I don't know if this is still true of if this has changed recently.
It's not clear to me if you intented to embed (s,n,c) in the column names to indicate whether they are surrogate, natural or common key. But I also don't think this is a good idea. I feel that would "reveal" some implementation detail that doesn't fit naturally in the logical model.
It looks like you are exposing/embedding the foreign key relationship in the column names. I have never thought of this, but I think you will deeply regret this one. If not only because it makes the column names unbearably uggly :)
When choosing a name for an index. The only time I regret naming an index something is when I look at an execution plan and see "index_01" being used. I always wish I had put the column name in the index to make it visible in the xplan. I don't know the limit for an index name, but I always run into the limit on Oracle. So, try to come up with some rule for how to abbreviate the table name. The column name is the important thing here.
Regarding mixed case. I always (no exceptions) go with either ALL_UPPER_CASE or all_lower_case. The reason is that in the past I've been burned when migrating queries between databases when they treat case differently. Lately, I use all_lower_case because the typical font of our editors makes it easier to spot spelling errors in lower case than in upper case. And when I fail at things, it doesn't seem like the editor is SHOUTING AT ME ;)