Integrating systems and snapshotting data exchange - integration

I wanted to know, if in general, when integrating 2 or more systems via whatever means (ie. webservice, MQ, etc.), is it a best practice or a standard for your system to capture a snapshot of data that you are sending with another system? I am thinking that this is as an insurance when reconciling is required for scenarios such as prod incidents.
Secondly, I would think this data snapshot is different from audit trail, in that the data being sent itself is saved (ie. xml data, csv file) as a LOB column in a snapshot table. Is this redundant with the audit trail?

For your first question ...
I've done many, many integrations using queues, web services, etc. and I will usually store an audit trail (a high level set of data telling me what happened), but I've never actually stored the payload itself for each call.
A few reasons for that:
The storage of the payloads being sent back and forth can get quite large.
I can usually reconstruct the payload using the audit trail. "Oh entity XYZ with ID 123 was sent yesterday. Let's take a look at what that entity looks like."
If you do the integration really well and have good testing around it, having copies of the payloads becomes unnecessary.
Instead of storing a copy of the payload I would focus on these things for integration:
Good unit tests on both sides and integration testing for the entire process.
Audit logs as you mentioned.
Good retry policies when a message fails (specifically for queues and topics).
Focusing on idempotent messages. So if something fails, you just do it again and everything is ok.

Related

Push Data onto Queue vs Pull Data by Workers

I am building a web site backend that involves a client submitting a request to perform some expensive (in time) operation. The expensive operation also involves gathering some set of information for it to complete.
The work that the client submits can be fully described by a uuid. I am hoping to use a service oriented architecture (SOA) (i.e. multiple micro-services).
The client communicates with the backend using RESTful communication over HTTP. I plan to use a queue that the workers performing the expensive operation can poll for work. The queue has persistence and offers decent reliability semantics.
One consideration is whether I gather all of the data needed for the expensive operation upstream and then enqueue all of that data or whether I just enqueue the uuid and let the worker fetch the data.
Here are diagrams of the two architectures under consideration:
Push-based (i.e. gather data upstream):
Pull-based (i.e. worker gathers the data):
Some things that I have thought of:
In the push-based case, I would be likely be blocking while I gathered the needed data so the client's HTTP request would not be responded to until the data is gathered and then enqueued. From a UI standpoint, the request would be pending until the response comes back.
In the pull based scenario, only the worker needs to know what data is required for the work. That means I can have multiple types of clients talking to various backends. If the data needs change I update just the workers and not each of the upstream services.
Any thing else that I am missing here?
Another benefit of the pull based approach is that you don't have to worry about the data getting stale in the queue.
I think you already pretty much explained that the second (pull-based) approach is better.
If a user's request should anyway be processed asynchronously, why wait for the data to be gathered and then return a response. You need just to queue a work item and return HTTP response.
Passing data via queue is not a good option. If you get the data upstream, you will have to pass it somehow other than via queue to the worker (usually BLOB storage). That is additional work that is not really needed in your case.
I would recommend Cadence Workflow instead of queues as it supports long running operations and state management out of the box.
Cadence offers a lot of other advantages over using queues for task processing.
Built it exponential retries with unlimited expiration interval
Failure handling. For example it allows to execute a task that notifies another service if both updates couldn't succeed during a configured interval.
Support for long running heartbeating operations
Ability to implement complex task dependencies. For example to implement chaining of calls or compensation logic in case of unrecoverble failures (SAGA)
Gives complete visibility into current state of the update. For example when using queues all you know if there are some messages in a queue and you need additional DB to track the overall progress. With Cadence every event is recorded.
Ability to cancel an update in flight.
See the presentation that goes over Cadence programming model.

What's the most efficient architecture for this system? (push or pull)

All s/w is Windows based, coded in Delphi.
Some guys submit some data, which I send by TCP to a database server running MySql.
Some other guys add a pass/fail to their data and update the database.
And a third group are just looking at reports.
Now, the first group can see a history of what they submitted. When the second group adds pass/fail, I would like to update their history. My options seem to be
blindly refresh the history regularly (in Delphi, I display on a DB grid so I would close then open the query), but this seems inefficient.
ask the database server regularly if anything changed in the last X minutes.
never poll the database server, instead letting it inform the user's app when something changes.
1 seems inefficient. 2 seems better. 3 reduces TCP traffic, but that isn't much. Anyway, just a few bytes for each 2. However, it has the disadvantage that both sides are now both TCP client and server.
Similarly, if a member of the third group is viewing a report and a member of either of the first two groups updates data, I wish to reflect this in the report. What it the best way to do this?
I guess there are two things to consider. Most importantly, reduce network traffic and, less important, make my code simpler.
I am sure this is a very common pattern, but I am new to this kind of thing, so would welcome advice. Thanks in advance.
[Update] Close voters, I have googled & can't find an answer. I am hoping for the beneft of your experience. Can you help me reword this to be acceptable? or maybe give a UTL which will help me? Thanks
Short answer: use notifications (option 3).
Long answer: this is a use case for some middle layer which propagates changes using a message-oriented middleware. This decouples the messaging logic from database metadata (triggers / stored procedures), can use peer-to-peer and publish/subscribe communication patterns, and more.
I have blogged a two-part article about this at
Firebird Database Events and Message-oriented Middleware (part 1)
Firebird Database Events and Message-oriented Middleware (part 2)
The article is about Firebird but the suggested solutions can be applied to any application / database.
In your scenarios, clients can also use the middleware message broker send messages to the system even if the database or the Delphi part is down. The messages will be queued in the broker until the other parts of the system are back online. This is an advantage if there are many clients and update installations or maintenance windows are required.
Similarly, if a member of the third group is viewing a report and a
member of either of the first two groups updates data, I wish to
reflect this in the report. What it the best way to do this?
If this is a real requirement (reports are usually a immutable 'snapshot' of data, but maybe you mean a view which needs to be updated while beeing watched, similar to a stock ticker) but it is easy to implement - a client just needs to 'subscribe' to an information channel which announces relevant data changes. This can be solved very flexible and resource-saving with existing message broker features like message selectors and destination wildcards. (Note that I am the author of some Delphi and Free Pascal client libraries for open source message brokers.)
Related questions:
Client-Server database application: how to notify clients that data was changed?
How to communicate within this system?
Each of your proposed solutions are all viable in certain situations.
I've been writing software for a long time and comments below relate to personal experience which dates way back to 1981. I have no doubt others will have alternative opinions which will also answer your questions.
Please allow me to justify the positives and negatives of each approach, and the parameters around each comment.
"blindly refresh the history regularly (in Delphi, I display on a DB grid so I would close then open the query), but this seems inefficient."
Yes, this is inefficient
Is often the quickest and simplest thing to do.
Seems like the best short-term temporary solution which gives maximum value for minimal effort.
Good for "exploratory coding" helping derive a better software design.
Should be a good basis to refine / explore alternatives.
It's very important for programmers to strive to document and/or share with team members who could be affected by your changes their team when a tech debt-inducing fix has been checked-in.
If not intended as production quality code, this is acceptable.
If usability is poor, then consider more efficient solutions, like what you've described below.
"ask the database server regularly if anything changed in the last X minutes."
You are talking about a "pull" or "polling" model. Consider the following API options for this model:
What's changed since the last time I called you? (client to provide time to avoid service having to store and retrieve seesion state)
If nothing has changed, server can provide a time when the client should poll again. A system under excessive load is then able to back-off clients, i.e if a server application has an awareness of such conditions, then it is therefore better able to control the polling rate of compliant clients, by instructing them to wait for a longer period before retrying.
After considering that, ask "Is the API as simple as it can possibly be?"
"never poll the database server, instead letting it inform the user's app when something changes."
This is the "push" model you're talking about- publishing changes, ready for subscribers to act upon.
Consider what impact this has on clients waiting for a push - timeout scenarios, number of clients, etc, System resource consumption, etc.
Consider that the "pusher" has to become aware of all consuming applications. If using industry standard messaging queueing systems (RabbitMQ, MS MQ, MQ Series, etc, all naturally supporting Publish/Subscribe JMS topics or equivalent then this problem is abstracted away, but also added some complexity to your application)
consider the scenarios where clients suddenly become unavailable, hypothesize failure modes and test the robustness of you system so you have confidence that it is able to recover properly from failure and consistently remain stable.
So, what do you think the right approach is now?

How to solve batch processing in ESB?

We have a legacy system that produces files that each contains hundreds of messages (financial transactions). We need to transform these messages into another format and submit them (individually) to a target system. The question is:
Should ESB accept these files for processing directly, or should there be an adapter application between the legacy system and ESB that would split received files into individual messages and let the ESB to process the messages individually (instead of processing the whole file)?
In the first solution we expect two ESB flows. The first one would transform the file into a new format, split it into the messages, and store these messages into a temporary location. The transformation needs to process the file as a whole, because the file contains some common sections that are needed for transformation of the individual messages.
The second flow would take the individual transformed messages (each in a separate DB transaction), pass them to the target system, and wait for its answer (synchronously or asynchronously).
The second solution would replace the first flow by an external application that would transform the file, split it into individual transformed messages, and store them in a temporary location (local file system). The second flow would stay in the ESB.
In our eyes, the disadvantage of the first solution is in that the ESB would have to process huge files (in the first flow), which is commonly considered an antipattern. On the other hand, the ESB would adjust directly to the interface of the legacy system, which is one of the purposes of ESB.
In the second solution, the adapter application would contain the transformation logic, which should be another of the purposes and responsibilities of ESB.
What is the commonly suggested solution for this situation (a pattern)? Could you provide some references that are more descriptive than these two links that I've found?
http://publib.boulder.ibm.com/infocenter/esbsoa/wesbv7r5/index.jsp?topic=%2Fcom.ibm.websphere.wesb.programming.doc%2Ftopics%2Fesbprog_patterns.html
https://www.ibm.com/developerworks/wikis/display/esbpatterns/File+Processing
Edit
Another reference:
http://www.ibm.com/developerworks/webservices/library/ws-largemessaging/
Remember that there are 3 message types in SOA: Command, Event, Document
That 'Document' bit is for chunks of data. It is probably better suited to 'real' document types such as 'Order' or 'Invoice' and the like but there is nothing stopping you from going with 'TransactionBatch'.
That being said, it is a rather unused message type in that not many service buses actually implement anything around it, since:
you do not really need it
many message queuing technologies have limits on message size (as low as 4kb) making it difficult to transport any large message (needs to be sent in chunks)
So what I would do in your scenario is have an endpoint that processes the file. So something like a ProcessTransactionFileCommand sent to the processing endpoint and in it you only have a reference to the actual file (stored somewhere in the file system or even a url to download from). That processing endpoint can process the file and send the individual messages (all within a transaction) to the integration endpoint that sends the message off to the external system. You could have a SendTransactionCommand to do that.
In this way your system is very flexible in that the integration endpoint can receive individual integration commands form some parts of your solution while the processing endpoint can handle the batch and split them into individual integration commands.
Should you be in the .NET space you may want to look at my FOSS service bus project: http://shuttle.codeplex.com/
But any service bus will do the trick (MassTransit, NServiceBus, etc.)
You can use an ESB for the first case and I don't think it would be an anti-pattern. The purpose of an ESB it's also to integrate legacy applications, that create files as output as in your use case, with other applications.
You can try Mule ESB. It will allow you to consume the file using streaming (through the file transport), map the content of the file to your desired output using a GUI called DataMapper and finally put those messages en a VM queue which can be a persistent queue within the ESB. This queues are transactional so you can guarantee that all the messages created from one file were put on the VM queue or none of them.
Then you can, from another flow (in fact processed within the ESB are called flows in mule) read each of those messages and process them.
HTH, Pablo.

Core Data and JSON question

I know this question has been possed before, but the explanation was a little unclear to me, my question is a little more general. I'm trying to conceptualize how one would periodically update data in an iPhone app, using a remote web service. In theory a portion of the data on the phone would be synced periodically (only when updated). While other data would require the user be online, and be requested on the fly.
Conceptually, this seems possible using XML-RPC or JSON and Core data. I wonder if anyone has an opinion on the best way to implement this, I am a novice iPhone developer, but I understand much of the process conceptually.
Thanks
To synchronize a set of entities when you don't have control over the server, here is one approach:
Add a touched BOOL attribute to your entity description.
In a sync attempt, mark all entity instances as untouched (touched = [NSNumber numberWithBool:NO]).
Loop through your server-side (JSON) instances and add or update entities from your Core Data store to your server-side store, or vice versa. The direction of updating will depend on your synchronization policy, and what data is "fresher" on either side. Either way, mark added, updated or sync'ed Core Data entities as touched (touched = [NSNumber numberWithBool:YES])
Depending on your sync policy, delete all entity instances from your Core Data store which are still untouched. Untouched entities were presumably deleted from your server-side store, because no addition, update or sync event took place between the Core Data store and the server for those objects.
Synchronization is a fair amount of work to implement and will depend on what degree of synchronization you need to support. If you're just pulling data, step 3 is considerably simpler because you won't need to push object updates to the server.
Syncing is hard, very hard. Ideally you would want to receive deltas of the changes from the server and then using a unique id for each record in Core Data, update only those records that are new or changed.
Assuming you can do that, then the code is pretty straight forward. If you are syncing in both directions then things get more complicated because you need to track deltas on both sides and handle collisions.
Can you clarify what type of syncing you are wanting to accomplish? Is it bi-directional or pull only?
I have an answer, but it's sucky. I'm currently looking for a more acceptable/reliable solution (i.e. anything Marcus Zarra cooks up).
What I've done needs some work ... seriously, because it doesn't work all the time...
The mobile device has a json catalog of entities, their versions, and a url pointing to a json file with the entity contents.
The server has the same setup, the catalog listing the entities, etc.
Whenever the mobile device starts, it compares the entity versions of it's local catalog with the catalog on the server. If any of those versions on the server are newer, it offers the user an opportunity to download the entity updates.
When the user elects to update, the mobile device now has the url for each of the new/changed entities and downloads it. Once downloaded, the app will blow away all objects for each of the changed entities, and then insert the new objects from JSON. In the event of an error, the deletions/insertions are rolled back to pre-update status.
This works, sort of. I can't catch it in a debug session when it goes awry, so I'm not sure what might cause corruption or inconsistency in the process.

Pattern for updating slave SQL Server 2008 databases from a master whilst minimising disruption

We have an ASP.NET web application hosted by a web farm of many instances using SQL Server 2008 in which we do aggregation and pre-processing of data from multiple sources into a format optimised for fast end user query performance (producing 5-10 million rows in some tables). The aggregation and optimisation is done by a service on a back end server which we then want to distribute to multiple read only front end copies used by the web application instances to facilitate maximum scalability.
My question is about the best way to get this data from a back end database out to the read only front end copies in such a way that does not kill their performance during the process. The front end web application instances will be under constant high load and need to have good responsiveness at all times.
The backend database is constantly being updated so I suspect that transactional replication will not be the best approach, as the constant stream of updates to the copies will hurt their performance.
Staleness of data is not a huge issue so snapshot replication might be the way to go, but this will result in poor performance during the periods of replication.
Doing a drop and bulk insert will result in periods with no data for user queries.
I don't really want to get into writing a complex cluster approach where we drop copies out of the cluster during updating - is there something along these lines that we can do without too much effort, or is there a better alternative?
There is actually a technology built into SQL Server 2005 (and 2008) that is designed to address this kind of issues. Service Broker (I'll refer further as SSB). The problem is that it has a very steep learning curve.
I know MySpace went public how uses SSB to manage their park of SQL Servers: MySpace Uses SQL Server Service Broker to Protect Integrity of 1 Petabyte of Data. I know of several more (major) sites that use similar patterns but unfortunately they have not gone public so I cannot refer names. I was personally involved with some projects around this technology (I am a former member of the SQL Server team).
Now bear in mind that SSB is not a dedicate data transfer technology like Replication. As such you will not find anyhting similar to the publishing wizards and simple deployment options of Replication (check a table and it gets transferred). SSB is a reliable messaging technology and as such its primitives stop at the level of message exchange, you would have to write the code that leverages the data change capture, packs it as messages and also the unpacking of message into relational tables at destination.
Why still some companies preffer SSB over Replication at a task like you describe is because SSB has a far better story when it comes to reliability and scalability. I know of projects that exchange data between 1500+ sites, far beyond the capabilities of Replication. SSB is also abstracted from the physical topology: you can move databases, rename machines, rebuild servers all without changing the application. Because data flow occurs over logical routes the application can addapt on-the-fly to new topologies. SSB is also resilient to long periods of disocnnect and downtime, being capable of resuming the data flow after hours, days and even months of disconnect. High troughput achieved by engine integration (SSB is part of the SQL engine itself, is not a collection of sattelite applications and processes like Replication) means that the backlog of changes can be processes on reasonable times (I know of sites that are going through half a million transactions per minute). SSB applications typically rely on internal Activation to process the incomming data. SSB also has some unique features like built-in load balancing (via routes) with sticky session semantics, support for deadlock free application specific correlated processing, priority data delivery, specific support for database mirroring, certificate based authentication for cross domain operations, built-in persisted timers and many more.
This is not a specific answer 'how to move data from table T on server A to server B'. Is more a generic technology on how to 'exhange data between server A and server B'.
I've never had to deal with this scenario before but did come up with a possible solution for this. Basically, it would require a change in your main database structure. Instead of storing the data, you would keep records of modifications of this data. Thus, if a record is added, you store "Table X, inserted new record with these values: ..." With modifications, just store the table, field and changed value. With deletions, just store which record is deleted. Every modification will be stored with a timestamp.
Your client systems would keep their local copies of the database and will regularly ask for all database modifications after a certain date/time. You then execute those modifications on the local database and it will be up-to-date again.
And the back-end? Well, it would just keep a list of modifications and perhaps a table with the base data. Keeping just the modifications also means you're keeping track of history, allowing you to ask the system what it looked like a year ago.
How well this would perform depends on the number of modifications on the back-end database. But if you request the changes every 15 minutes, it shouldn't be that much data every time.
But again, I never had the chance to work this out in a real application so it's still a theoretic principle for me. It seems fast but a lot of work will be required.
Option 1: Write an app to transfer the data using row level transactions. It might take longer but would result in no interruption of the site using the data because the rows are there before and after the read occurs, just with new data. This processing would happen on a separate server to minimize load.
In sql server 2008 you can set READ_COMMITTED_SNAPSHOT to ON to ensure that the row being updated is not causing blocking.
But basically all this app does is read the new data as it is available out from one database and into the other.
Option 2: Move the data (tables or entire database) from the aggregation server to the front-end server. Automate this if possible. Then switch your web application to point to the new database or tables for future requests. This works but requires control over the web app, which you may not have.
Option 3: If you were talking about a single table (or this could work with many) what you can do is a view swap. So you write your code against a sql view which points to table A. You do you work on Table B and when it's ready, you update the view to point to Table B. You can even write a function that determines the active table and automate the whole swap thing.
Option 4: You might be able to use something like byte-level replication of the server. That sounds scary though. Which is basically copying the server from point A to point B exactly down to the very bytes. It's mostly used in DR situations which this sounds like it could be a kinda/sorta DR situation, but not really.
Option 5: Give up and learn how to sell insurance. :)