Domain event storage in SQL... use JSON serialization? - json

I'm looking at refactoring an existing code-base before we let it loose in the wild with our first customer, and I really don't like the current domain event storage structure and I was trying to come up with a good way to store the many & very different events in RDB tables.
Architecture:
Web app
Java
Spring 4
Spring JDBC
MSSQL
Details:
40+- different events, each related to either the aggregate root or one of it's child elements.
Details of the event are stored, but not the object state (so no CQRS event sourcing)
Events are ONLY used for reports
Reports are fed with java objects. IE. the reports do NOT run directly off of SQL.
Currently the events are stored in a single event table per bounded context. In order to hold all the different event types & data the table schema looks like this
event_id long,
event_type varchar,
event_time datetime,
context_1_key varchar,
context_1_val varchar,
context_2_key varchar,
context_2_val varchar,
context_3_key varchar,
...repeat like 10x...
so, for example(order=aggregate root, item=child of order):
event_type=ITEM_OPEN
context_1_key=ORDER_ID
context_1_value=1000
context_2_key=ITEM_ID
context_2_value=2000
No I don't like it, and no I was not responsible for doing that.
Issues:
context_1_xxx fields are fragile and difficult to maintain/troubleshoot/expand
Everything stuffed into one table will be a performance problem(even though reporting is not performance sensitive)
Events are linked to the domain object; they don't store the state of the object. eg. the recorded event is useless if the item is deleted
My gut tells me creating 40 different tables with schema unique to each event is not the answer. Instead I was thinking of serializing(JSON) a snapshot of the domain object(s) to be saved along with the event data.
It seems convenient solution:
we already use a Spring/Jackson module to serialize objects for the browser-based clients.
the team is pretty comfortable with the serialize/deserialize process so there is no major learning curve.
the event data must go back through the application for generating reports which will be easy enough by de-serializing with Jackson
The only real downsides I can see are:
- unable to use SQL based 3rd party reporting tools
- unable to index the tables on the properties of the stored object(JSON)
I can somewhat mitigate issue#2 by further breaking down the event storage into a few different tables.
What else am I missing? Is there an established best-approach to accomplishing this? How do you do it?

Start with Building an Event Storage, by Greg Young.
Konrad Garus describes an event store using PostgresSQL.
My gut tells me creating 40 different tables with schema unique to each event is not the answer.
Probably not. First cut should be a single table for events of all types. You have a blob (json is fine) for the event data, and a similar blob for the event metadata, and then a bunch of columns that you use to extract correctly ordered histories of events.
Instead I was thinking of serializing(JSON) a snapshot of the domain object(s) to be saved along with the event data.
That's an odd phrase. JSON representation of the event data? That makes sense.
"Snapshot" is an eyebrow raiser, though -- you don't need a snapshot of the event data, because the event is immutable. You definitely don't want to be mixing state snapshots (ie, the result of rolling up the history of events) with the events themselves.
Followup: looking at how a GetEventStore .NET client writes and reads event histories might give you additional ideas for how to design your schema. Notice that the event data/metadata are being handled as blobs.

Related

Best way to snapshot MySQL data for audit trail

I'm looking on advice for serializing information for an audit trail in a MySQL database.
I store events that have multiple relations, 6 to be exact. In an Events table. Therefore each record has 6 foreign keys in it. I'm wondering what the most scalable approach for serializing the related entities is, and storing it in this same Event record. This is because the data should persist even if the underlying records have been deleted or changed.
The API is Typescript and we interface the DB with Typeorm. My initial approach was going to be adding an #BeforeInsert event the loads all the related entities as json (some may or may not be present), and storing all of the entities in that format. Either using the "json" field, text, or doing some sort of blob conversion.
This will work for the foreseeable future, but I'm wondering what the most scalable approach would be. Thank you!
I don't know what is the best solution, and maybe it will depend on your features / entities.
Here are some other options that you can use :
Concerning the deleted record : use softRemove / softDelete. It permits to never delete record in DB but set it as deleted for typeorm, using #DeleteDateColumn decorator.
Check the two following properties of typeorm : softRemove / softDelete and recover / restore.
Concerning the updated record : a workaround for the JSON stringify is to never update an entity and always make a copy and then update that copy (not clearly the best solution). Typeorm also use #VersionColumn decorator to persist versioning of the entity. You can change your code to take the older version of that entity.
Depending on your entity the table can grow very fast !
With these two options there is always a persistance of your data.
The advantage can be that it will be resilient for model migration as all your data will be stored as data and not JSON stringify text in a column.
The disadvantage is obviously the size of the database, the maintainability, and the development of it.
Otherwise you can use a subscriber that handles all modifications on the Event table. In this subscriber you can insert your modifications in a table name EventAudit or EvenLogWhatever something like "Name was updated from X to Y" or even more complex with an another database storing all version. In fact, you can do whatever you like inside this global subscriber as you can access the datasource, the entity, and the manager.
I hope it will help

What is the best practices for pairing two user in Firebase realtime database?

I have been thinking about this title for a long time.
If we want to randomly pair two users and don't consider any conditions, What should I do in database structure and code?
Also, if we have many conditions to query the user, is it not suitable for using the Realtime database and I should use MySQL or something else?
I don't have the experiment in this field and like to know how most people would do.
Thank you.
You should have a "pairs" node, which lists the pair of each user.
When a user wants to find a pair:
Add a key-value node to "pairs", where the key is the UID, and the value is an empty string.
Add a listener to your new node.
Search in "pairs" for another user that has an empty string as a value.
If found, change the values of both nodes to the other user's UID.
When the listener callback will be called, it means some other user just paired with you, so you can use the value to know the UID of the other user. Also, don't forget to cancel the listener.
The reads and writes to the database should be atomic, in order to prevent bugs in the pairing process (like overriding an existing pair). Therefore, you should use
firebase transactions.
If there are certain conditions for a pair, you can save the conditions' data in the node of the user, inside the "pairs" node (for temporary data), or inside the "users" node that you probably have already (for long term data).
Important
However!, this method leaks data about the existing pairs, and about the users that are waiting for a pair. I recommend moving the code to your server or to a cloud function. Security is really critical here! You should also write some strict security rules for the database.
Hope I managed to help! 😀

Is there a mysql "multitable" trigger - E.G. a trigger that fires only after ALL tables are done updating?

several tables have a trigger that generates a json object representations of the row on update/insert. E.G. {"email": ..., "relations: N" } //<--an email.json column and stores it in a json column
relations is simply a numeric tie-together (let me know if there is a word for it) that allows me to tie together multiple names, emails, phones, homes into one object -
E.G. the touchRelation.json column
{
"emails": [ {"email": 1#a.com },{"email: 2#a.com"},{"email: N#a.com"}],
"teles" : [ {"tele" : ... },{"tele : ...."},{"tele : ...."}],
"Names" : [ {"Name" : ... },{"Name : ...."},{"Name : ...."}],
"Homes" : [ {"Home" : ... },{"Home : ...."},{"Home : ...."}],
}
The problem I'm having is that 1) it would be wasteful and inefficient to update touchRelations.json EVERY TIME one of the other tables gets data CRUD, especially if several tables are updated at one time
2) I may not be able to rely on the developer to call an update_Relations_json() after each query.
Is there a simple way tell if one or more of the tables have been updated and ONLY regenerate relations.json after all updates on all tables have finished?
One Possible Solution would be to create a "pending Updates" table that stores the information in a queue and one by one inserts/updates the data from the queue table to the storage table then calls the update function, but I'm sure this isn't the best option.
Another option would be to create a JSON parser in the db that reads the complete json relation (the big one from above), updates the tables then builds the json object, but that seems like a poor use of the database.
The BEST option I can think of would be to create an "updates" meta-data column with a default of 0. When we update the phone, email, name, or home the meta-data column is changed to 1 (representing that an update was not committed to the relations JSON column)
Next create a stored procedure "request_relations_json()" that checks for pending commits (a 1 in the "updates" meta-column). IF there are no updates, return the current relations.json column to the application. If there are updates, regenerate the json then return it to the application.
It's hackish, but it isn't generating json on every update. I still hope there's a more elegant solutions out there.
Not sure if this is an answer or a comment, but...
MySQL doesn't support transaction triggers. This means that you can only fire triggers when data is changed in a table. In your example, I'm guessing there is no pre-defined order or combination for the data changes - in one case, you might be creating a brand new record, complete with email, address, name and phone; in another case, you might be adding a new phone number to an existing record.
Having a "trigger per table" is the only way in which you can achieve what you want without resorting to exotic solutions (like mimicking a materialized view).
However, as you are already storing your data 3 times (once in the "normal" columns, once in the table JSON, once in the relations JSON), is efficiency really such a big deal? Do you know this is a problem?
A bigger concern for me is that I dislike triggers on principle - they're hard to test, harder to debug, and harder still to maintain in an evolving database.
Triggers that act on data or tables outside the current row make me very nervous - testing the different permutations of insert/update/delete on your 4 tables would be extremely hard - what if the trigger on "touchEmail" has a bug that overwrites the data managed by the trigger on "touchHome"? You may also face deadlocks etc. (not sure if that's a realistic concern on MySQL).
Have you considered using a different cache for your JSON? There are a number of options. MySQL has a query cache; if you can rely on this, you would create the JSON on the fly, and cache the queries. This has the huge benefit of dealing with cache invalidation automagically - as the underlying data changes, MySQL purges the relevant items in the cache. On the downside - tuning this cache is tricky.
The next option is whatever your programming language/framework gives you. Most modern frameworks include a solution for caching, but you almost certainly end up invalidating the cache in code; this can be a complex solution, but puts the responsibility where it belongs (the application developers).
If your solution has to scale to exotic levels, you can use a dedicated cache - memcache is available for most environments and languages. It works, scales, is robust - but also introduces significant additional complexity.

Data theory, json and client applications

So, I've been coding web apps for some time now... typically I've done both the data structs and retrieval and the client side coding. I now have a data admin teammate working with me and his sole job is to return data from a database to an api that serves json; standard stuff.
Recently, I have been having a disagreement with him on how this data should be returned. Essentially, we have two json objs, the first loaded remotely once (which includes racer name, racer number, etc...) when the application starts. Secondly, (during the race which is a recurring timed data call) we receive positions incrementally which contains a racer's lat/lon, spd etc.
Where we differ is that he is stating that it is "inefficient" to return the racer name (the first call) in the telem string (the second call). What this forces me to do is to keep the first data obj in a global obj, and then essentially get the racer's lat/long, spd from the second data obj "on the fly" using a join lookup function, which then returns a new json obj that I populate to a racer grid using jqGrid (looks something like this: getRaceDataByID(json[0].id){//lookup race data by racer id in json[1] where json[1].id == json[0].id[lat/lon, spd] and return new json obj row to populate jqgrid})).
The result seems to be to be an overly-coded/slow client (jquery) application.
My question is about theory. Of course I understand traditional data structs, normalization, sql etc... But in today world of "webapps" and the idea that it seems that larger web services are going away from "traditional sql" data structures and just returning the data as the client needs. In this sense, it would mean adding about 3 fields (name, bib number, vehicle type, etc...) to the sql call on each position telem call so I can display the data on the client per the interface's requirement (a data table that display real-time speed, lat/lon, etc...).
So finally, my question: has anyone had to deal with a situation like this and am I "all wet" in thinking that 3 fields per row, in today's world of massive data dependent web applications, that this is not a huge issue to be squabling over.
Please note: I understand that traditionally, you would not want to send more data than you need and that his understanding of data structs and inefficient data transfers (not sending more data than you need) is actually correct.
But, many times when I'm coding a web apps, it's often looked at a bit differently b/c of the stateless nature of the browser, and IMHO and it's much easier to just send the data that is needed. My question, is not being driven by not wanting to code the solution, but rather trying to put less load on the client by not having to re-stitch the json obj into something that I needed in the first place.
I think it makes sense to send these 3 fields with the rest of the data, even if this warrants some sort of duplication. You get the following advantages:
You don't have to maintain the names of racers from the first call in your browser
Your coding logic is simplified (don't have to match up racer names to subsequent calls, the packet contains the info. already)
As far as speed goes, you are doing the majority of the work in your remote call, adding another 3 fields doesn't matter IMHO. It makes your app cleaner.
So I guess I agree w/you.

mysql key/value store problem

I'm trying to implement a key/value store with mysql
I have a user table that has 2 columns, one for the global ID and one for the serialized data.
Now the problem is that everytime any bit of the user's data changes, I will have to retrieve the serialized data from the db, alter the data, then reserialize it and throw it back into the db. I have to repeat these steps even if there is a very very small change to any of the user's data (since there's no way to update that cell within the db itself)
Basically i'm looking at what solutions people normally use when faced with this problem?
Maybe you should preprocess your JSON data and insert data as a proper MySQL row separated into fields.
Since your input is JSON, you have various alternatives for converting data:
You mentioned many small changes happen in your case. Where do they occur? Do they happen in a member of a list? A top-level attribute?
If updates occur mainly in list members in a part of your JSON data, then perhaps every member should in fact be represented in a different table as separate rows.
If updates occur in an attribute, then represent it as a field.
I think cost of preprocessing won't hurt in your case.
When this is a problem, people do not use key/value stores, they design a normalized relational database schema to store the data in separate, single-valued columns which can be updated.
To be honest, your solution is using a database as a glorified file system - I would not recommend this approach for application data that is core to your application.
The best way to use a relational database, in my opinion, is to store relational data - tables, columns, primary and foreign keys, data types. There are situations where this doesn't work - for instance, if your data is really a document, or when the data structures aren't known in advance. For those situations, you can either extend the relational model, or migrate to a document or object database.
In your case, I'd see firstly if the serialized data could be modeled as relational data, and whether you even need a database. If so, move to a relational model. If you need a database but can't model the data as a relational set, you could go for a key/value model where you extract your serialized data into individual key/value pairs; this at least means that you can update/add the individual data field, rather than modify the entire document. Key/value is not a natural fit for RDBMSes, but it may be a smaller jump from your current architecture.
when you have a key/value store, assuming your serialized data is JSON,it is effective only when you have memcached along with it, because you don't update the database on the fly every time but instead you update the memcache & then push that to your database in background. so definitely you have to update the entire value but not an individual field in your JSON data like address alone in database. You can update & retrieve data fast from memcached. since there are no complex relations in database it will be fast to push & pull data from database to memcache.
I would continue with what you are doing and create separate tables for the indexable data. This allows you to treat your database as a single data store which is managed easily through most operation groups including updates, backups, restores, clustering, etc.
The only thing you may want to consider is to add ElasticSearch to the mix if you need to perform anything like a like query just for improved search performance.
If space is not an issue for you, I would even make it an insert only database so any changes adds a new record that way you can keep the history. Of course you may want to remove the older records but you can have a background job that would delete the superseded records in a batch in the background. (Mind you what I described is basically Kafka)
There's many alternatives out there now that beats RDBMS in terms of performance. However, they all add extra operational overhead in that it's yet another middleware to maintain.
The way around that if you have a microservices architecture is to keep the middleware as part of your microservice stack. However, you have to deal with transmitting the data across the microservices so you'd still end up with a switch to Kafka underneath it all.