Best way to snapshot MySQL data for audit trail - mysql

I'm looking on advice for serializing information for an audit trail in a MySQL database.
I store events that have multiple relations, 6 to be exact. In an Events table. Therefore each record has 6 foreign keys in it. I'm wondering what the most scalable approach for serializing the related entities is, and storing it in this same Event record. This is because the data should persist even if the underlying records have been deleted or changed.
The API is Typescript and we interface the DB with Typeorm. My initial approach was going to be adding an #BeforeInsert event the loads all the related entities as json (some may or may not be present), and storing all of the entities in that format. Either using the "json" field, text, or doing some sort of blob conversion.
This will work for the foreseeable future, but I'm wondering what the most scalable approach would be. Thank you!

I don't know what is the best solution, and maybe it will depend on your features / entities.
Here are some other options that you can use :
Concerning the deleted record : use softRemove / softDelete. It permits to never delete record in DB but set it as deleted for typeorm, using #DeleteDateColumn decorator.
Check the two following properties of typeorm : softRemove / softDelete and recover / restore.
Concerning the updated record : a workaround for the JSON stringify is to never update an entity and always make a copy and then update that copy (not clearly the best solution). Typeorm also use #VersionColumn decorator to persist versioning of the entity. You can change your code to take the older version of that entity.
Depending on your entity the table can grow very fast !
With these two options there is always a persistance of your data.
The advantage can be that it will be resilient for model migration as all your data will be stored as data and not JSON stringify text in a column.
The disadvantage is obviously the size of the database, the maintainability, and the development of it.
Otherwise you can use a subscriber that handles all modifications on the Event table. In this subscriber you can insert your modifications in a table name EventAudit or EvenLogWhatever something like "Name was updated from X to Y" or even more complex with an another database storing all version. In fact, you can do whatever you like inside this global subscriber as you can access the datasource, the entity, and the manager.
I hope it will help

Related

Best practices for transforming a relational database to a non relational?

I have a MySQL Database and I need to create a Mongo Database (I don't care about keeping any data).
So are there any good practices for designing the structure (mongoose.Schema) based on the relational tables of MySQL ?
For example, the SQL has a table users and a table courses with relation 1:n, should I also create two collections in MongoDB or would it be better to create a new field courses: [] inside user document and create only the user collection ?
The schema definition should be driven by the use cases of the application.
Under which conditions is data accessed and modified. Which is the leading entity.
e.g. When a user is loaded do you always also want to know the courses of the user? This would be an argument for embedding.
Can you update courses without knowing all of its users, e.g. update the name of a course? Do you want to list an overview of all courses? This would be an argument for extracting into an own collection.
So there is no general guideline for such migration as only from the schema definition, the use cases cannot be derived.
If you don't care about data, the best approach is to redesign it from scratch.
NoSQLs differ from RDBMS in many ways so direct mapping will hardly be efficient and in many cases not possible at all.
First thing you need to answer to yourself (and probably to mention in the question) is why you need to change database in the first place. There are different kind of problems that Mongo can solve better than SQL and they require different data models. None of them come for free so you will need to understand the tradeoffs.
You can start from the very simple rule: in SQL you model your data after your business objects and describe relations between them, in Mongo you model data after queries that you need to respond to. As soon as you grasp the idea it will let you ask answerable questions.
It may worth reading https://www.mongodb.com/blog/post/building-with-patterns-a-summary as a starting point.
An old yet still quite useful https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-1 Just keep in mind it was written long time ago when mongo did not have many of v4+ features. Nevertheless it describes philosophy of mongo data modelling with simple examples.It didn't change much since then.

Domain event storage in SQL... use JSON serialization?

I'm looking at refactoring an existing code-base before we let it loose in the wild with our first customer, and I really don't like the current domain event storage structure and I was trying to come up with a good way to store the many & very different events in RDB tables.
Architecture:
Web app
Java
Spring 4
Spring JDBC
MSSQL
Details:
40+- different events, each related to either the aggregate root or one of it's child elements.
Details of the event are stored, but not the object state (so no CQRS event sourcing)
Events are ONLY used for reports
Reports are fed with java objects. IE. the reports do NOT run directly off of SQL.
Currently the events are stored in a single event table per bounded context. In order to hold all the different event types & data the table schema looks like this
event_id long,
event_type varchar,
event_time datetime,
context_1_key varchar,
context_1_val varchar,
context_2_key varchar,
context_2_val varchar,
context_3_key varchar,
...repeat like 10x...
so, for example(order=aggregate root, item=child of order):
event_type=ITEM_OPEN
context_1_key=ORDER_ID
context_1_value=1000
context_2_key=ITEM_ID
context_2_value=2000
No I don't like it, and no I was not responsible for doing that.
Issues:
context_1_xxx fields are fragile and difficult to maintain/troubleshoot/expand
Everything stuffed into one table will be a performance problem(even though reporting is not performance sensitive)
Events are linked to the domain object; they don't store the state of the object. eg. the recorded event is useless if the item is deleted
My gut tells me creating 40 different tables with schema unique to each event is not the answer. Instead I was thinking of serializing(JSON) a snapshot of the domain object(s) to be saved along with the event data.
It seems convenient solution:
we already use a Spring/Jackson module to serialize objects for the browser-based clients.
the team is pretty comfortable with the serialize/deserialize process so there is no major learning curve.
the event data must go back through the application for generating reports which will be easy enough by de-serializing with Jackson
The only real downsides I can see are:
- unable to use SQL based 3rd party reporting tools
- unable to index the tables on the properties of the stored object(JSON)
I can somewhat mitigate issue#2 by further breaking down the event storage into a few different tables.
What else am I missing? Is there an established best-approach to accomplishing this? How do you do it?
Start with Building an Event Storage, by Greg Young.
Konrad Garus describes an event store using PostgresSQL.
My gut tells me creating 40 different tables with schema unique to each event is not the answer.
Probably not. First cut should be a single table for events of all types. You have a blob (json is fine) for the event data, and a similar blob for the event metadata, and then a bunch of columns that you use to extract correctly ordered histories of events.
Instead I was thinking of serializing(JSON) a snapshot of the domain object(s) to be saved along with the event data.
That's an odd phrase. JSON representation of the event data? That makes sense.
"Snapshot" is an eyebrow raiser, though -- you don't need a snapshot of the event data, because the event is immutable. You definitely don't want to be mixing state snapshots (ie, the result of rolling up the history of events) with the events themselves.
Followup: looking at how a GetEventStore .NET client writes and reads event histories might give you additional ideas for how to design your schema. Notice that the event data/metadata are being handled as blobs.

Configuration Data - JSON stored in Table versus individual fields

I have a table A that contains the definition/configuration for a form (fields, display information, etc). I perform a lookup into that table to determine what the form that is being displayed looks like. We also dynamically create tables to hold data as specified in that form or record.
When working with other developers, twice it has been suggested to store the field information in JSON format in a single field in table A instead of individual fields for configuration.
My principle concern is one of performance. We are retrieving row information from Table A or we are retrieving row information from table A and parsing it in the client.
Which is better in terms of performance? In terms of code reuse?
Short answer is yes, storing configuration as a serialized JSON document will give you the flexibility of changing and propagating changes easily likely with less code. Ideally, let the client do the deserialization,
Assuming documents are fairly small (<5K) processing cost is negligible, and as long as your access pattern is key/vale based the database performance should not be different from accessing any other row by primary key. Make sure to index the key.
But more broadly I would consider the following,
A document store for this scenario (for both the configuration and data).
Consider separating schema definition from the user/system preferences.
Shard data by the key (this would be a replacement for creating separate tables)
My principle concern is one of performance. We are retrieving row
information from Table A or we are retrieving row information from
table A and parsing it in the client.
Which is better in terms of performance? In terms of code reuse?
I do not see performance as a problem here.
JSON Pros
Schema flexibility. If you change or add something, you need not to touch database tables
Configuration richness. JSON is more expressive, than a database table
Easy nested structure support
JSON Cons
Inability to change only a part of JSON object. You have to deserialize it, change, serialize again and then store.
Inability to easily change a part of many objects. Where simple UPDATE...WHERE can be issued for database table, you will have to read your database row by row and update each separately when using JSON.
Weak versioning. Changing JSON schema format is not very simple and is not obvious. When you change database structure, it's always visible and straightforward process. JSON schema change is not so obvious.
If you go with JSON, I recommend to use JSON schemas to validate current versions of data. And consider making a migration regulation. If JSON schema changes, a special migration must be prepared, which will walk database and restructure all JSON data there in a single transaction.

Is there a mysql "multitable" trigger - E.G. a trigger that fires only after ALL tables are done updating?

several tables have a trigger that generates a json object representations of the row on update/insert. E.G. {"email": ..., "relations: N" } //<--an email.json column and stores it in a json column
relations is simply a numeric tie-together (let me know if there is a word for it) that allows me to tie together multiple names, emails, phones, homes into one object -
E.G. the touchRelation.json column
{
"emails": [ {"email": 1#a.com },{"email: 2#a.com"},{"email: N#a.com"}],
"teles" : [ {"tele" : ... },{"tele : ...."},{"tele : ...."}],
"Names" : [ {"Name" : ... },{"Name : ...."},{"Name : ...."}],
"Homes" : [ {"Home" : ... },{"Home : ...."},{"Home : ...."}],
}
The problem I'm having is that 1) it would be wasteful and inefficient to update touchRelations.json EVERY TIME one of the other tables gets data CRUD, especially if several tables are updated at one time
2) I may not be able to rely on the developer to call an update_Relations_json() after each query.
Is there a simple way tell if one or more of the tables have been updated and ONLY regenerate relations.json after all updates on all tables have finished?
One Possible Solution would be to create a "pending Updates" table that stores the information in a queue and one by one inserts/updates the data from the queue table to the storage table then calls the update function, but I'm sure this isn't the best option.
Another option would be to create a JSON parser in the db that reads the complete json relation (the big one from above), updates the tables then builds the json object, but that seems like a poor use of the database.
The BEST option I can think of would be to create an "updates" meta-data column with a default of 0. When we update the phone, email, name, or home the meta-data column is changed to 1 (representing that an update was not committed to the relations JSON column)
Next create a stored procedure "request_relations_json()" that checks for pending commits (a 1 in the "updates" meta-column). IF there are no updates, return the current relations.json column to the application. If there are updates, regenerate the json then return it to the application.
It's hackish, but it isn't generating json on every update. I still hope there's a more elegant solutions out there.
Not sure if this is an answer or a comment, but...
MySQL doesn't support transaction triggers. This means that you can only fire triggers when data is changed in a table. In your example, I'm guessing there is no pre-defined order or combination for the data changes - in one case, you might be creating a brand new record, complete with email, address, name and phone; in another case, you might be adding a new phone number to an existing record.
Having a "trigger per table" is the only way in which you can achieve what you want without resorting to exotic solutions (like mimicking a materialized view).
However, as you are already storing your data 3 times (once in the "normal" columns, once in the table JSON, once in the relations JSON), is efficiency really such a big deal? Do you know this is a problem?
A bigger concern for me is that I dislike triggers on principle - they're hard to test, harder to debug, and harder still to maintain in an evolving database.
Triggers that act on data or tables outside the current row make me very nervous - testing the different permutations of insert/update/delete on your 4 tables would be extremely hard - what if the trigger on "touchEmail" has a bug that overwrites the data managed by the trigger on "touchHome"? You may also face deadlocks etc. (not sure if that's a realistic concern on MySQL).
Have you considered using a different cache for your JSON? There are a number of options. MySQL has a query cache; if you can rely on this, you would create the JSON on the fly, and cache the queries. This has the huge benefit of dealing with cache invalidation automagically - as the underlying data changes, MySQL purges the relevant items in the cache. On the downside - tuning this cache is tricky.
The next option is whatever your programming language/framework gives you. Most modern frameworks include a solution for caching, but you almost certainly end up invalidating the cache in code; this can be a complex solution, but puts the responsibility where it belongs (the application developers).
If your solution has to scale to exotic levels, you can use a dedicated cache - memcache is available for most environments and languages. It works, scales, is robust - but also introduces significant additional complexity.

mysql key/value store problem

I'm trying to implement a key/value store with mysql
I have a user table that has 2 columns, one for the global ID and one for the serialized data.
Now the problem is that everytime any bit of the user's data changes, I will have to retrieve the serialized data from the db, alter the data, then reserialize it and throw it back into the db. I have to repeat these steps even if there is a very very small change to any of the user's data (since there's no way to update that cell within the db itself)
Basically i'm looking at what solutions people normally use when faced with this problem?
Maybe you should preprocess your JSON data and insert data as a proper MySQL row separated into fields.
Since your input is JSON, you have various alternatives for converting data:
You mentioned many small changes happen in your case. Where do they occur? Do they happen in a member of a list? A top-level attribute?
If updates occur mainly in list members in a part of your JSON data, then perhaps every member should in fact be represented in a different table as separate rows.
If updates occur in an attribute, then represent it as a field.
I think cost of preprocessing won't hurt in your case.
When this is a problem, people do not use key/value stores, they design a normalized relational database schema to store the data in separate, single-valued columns which can be updated.
To be honest, your solution is using a database as a glorified file system - I would not recommend this approach for application data that is core to your application.
The best way to use a relational database, in my opinion, is to store relational data - tables, columns, primary and foreign keys, data types. There are situations where this doesn't work - for instance, if your data is really a document, or when the data structures aren't known in advance. For those situations, you can either extend the relational model, or migrate to a document or object database.
In your case, I'd see firstly if the serialized data could be modeled as relational data, and whether you even need a database. If so, move to a relational model. If you need a database but can't model the data as a relational set, you could go for a key/value model where you extract your serialized data into individual key/value pairs; this at least means that you can update/add the individual data field, rather than modify the entire document. Key/value is not a natural fit for RDBMSes, but it may be a smaller jump from your current architecture.
when you have a key/value store, assuming your serialized data is JSON,it is effective only when you have memcached along with it, because you don't update the database on the fly every time but instead you update the memcache & then push that to your database in background. so definitely you have to update the entire value but not an individual field in your JSON data like address alone in database. You can update & retrieve data fast from memcached. since there are no complex relations in database it will be fast to push & pull data from database to memcache.
I would continue with what you are doing and create separate tables for the indexable data. This allows you to treat your database as a single data store which is managed easily through most operation groups including updates, backups, restores, clustering, etc.
The only thing you may want to consider is to add ElasticSearch to the mix if you need to perform anything like a like query just for improved search performance.
If space is not an issue for you, I would even make it an insert only database so any changes adds a new record that way you can keep the history. Of course you may want to remove the older records but you can have a background job that would delete the superseded records in a batch in the background. (Mind you what I described is basically Kafka)
There's many alternatives out there now that beats RDBMS in terms of performance. However, they all add extra operational overhead in that it's yet another middleware to maintain.
The way around that if you have a microservices architecture is to keep the middleware as part of your microservice stack. However, you have to deal with transmitting the data across the microservices so you'd still end up with a switch to Kafka underneath it all.