How to solve batch processing in ESB? - integration

We have a legacy system that produces files that each contains hundreds of messages (financial transactions). We need to transform these messages into another format and submit them (individually) to a target system. The question is:
Should ESB accept these files for processing directly, or should there be an adapter application between the legacy system and ESB that would split received files into individual messages and let the ESB to process the messages individually (instead of processing the whole file)?
In the first solution we expect two ESB flows. The first one would transform the file into a new format, split it into the messages, and store these messages into a temporary location. The transformation needs to process the file as a whole, because the file contains some common sections that are needed for transformation of the individual messages.
The second flow would take the individual transformed messages (each in a separate DB transaction), pass them to the target system, and wait for its answer (synchronously or asynchronously).
The second solution would replace the first flow by an external application that would transform the file, split it into individual transformed messages, and store them in a temporary location (local file system). The second flow would stay in the ESB.
In our eyes, the disadvantage of the first solution is in that the ESB would have to process huge files (in the first flow), which is commonly considered an antipattern. On the other hand, the ESB would adjust directly to the interface of the legacy system, which is one of the purposes of ESB.
In the second solution, the adapter application would contain the transformation logic, which should be another of the purposes and responsibilities of ESB.
What is the commonly suggested solution for this situation (a pattern)? Could you provide some references that are more descriptive than these two links that I've found?
http://publib.boulder.ibm.com/infocenter/esbsoa/wesbv7r5/index.jsp?topic=%2Fcom.ibm.websphere.wesb.programming.doc%2Ftopics%2Fesbprog_patterns.html
https://www.ibm.com/developerworks/wikis/display/esbpatterns/File+Processing
Edit
Another reference:
http://www.ibm.com/developerworks/webservices/library/ws-largemessaging/

Remember that there are 3 message types in SOA: Command, Event, Document
That 'Document' bit is for chunks of data. It is probably better suited to 'real' document types such as 'Order' or 'Invoice' and the like but there is nothing stopping you from going with 'TransactionBatch'.
That being said, it is a rather unused message type in that not many service buses actually implement anything around it, since:
you do not really need it
many message queuing technologies have limits on message size (as low as 4kb) making it difficult to transport any large message (needs to be sent in chunks)
So what I would do in your scenario is have an endpoint that processes the file. So something like a ProcessTransactionFileCommand sent to the processing endpoint and in it you only have a reference to the actual file (stored somewhere in the file system or even a url to download from). That processing endpoint can process the file and send the individual messages (all within a transaction) to the integration endpoint that sends the message off to the external system. You could have a SendTransactionCommand to do that.
In this way your system is very flexible in that the integration endpoint can receive individual integration commands form some parts of your solution while the processing endpoint can handle the batch and split them into individual integration commands.
Should you be in the .NET space you may want to look at my FOSS service bus project: http://shuttle.codeplex.com/
But any service bus will do the trick (MassTransit, NServiceBus, etc.)

You can use an ESB for the first case and I don't think it would be an anti-pattern. The purpose of an ESB it's also to integrate legacy applications, that create files as output as in your use case, with other applications.
You can try Mule ESB. It will allow you to consume the file using streaming (through the file transport), map the content of the file to your desired output using a GUI called DataMapper and finally put those messages en a VM queue which can be a persistent queue within the ESB. This queues are transactional so you can guarantee that all the messages created from one file were put on the VM queue or none of them.
Then you can, from another flow (in fact processed within the ESB are called flows in mule) read each of those messages and process them.
HTH, Pablo.

Related

right tech choice to work with Json data?

We are a data team working with our data producers to store and process infrastructure log data. Workers running at our client systems generate log data which are primarily in json format.
There is no defined structure to the json data as it depends on multiple factors like # of clusters run by the client, tokens generated etc. There is some definite structure to the top-level json elements that contain the metadata where the logs are generated. Actual data can go into multiple levels of nesting and varying key-value pairs.
I want to build a sytem to ingest these logs, parse them and present in a way where engineers and PMs(prod managers) can read the data for analytics usecases.
My initial plan is to setup a compute layer like Kinesis to the source and write parsing logic to store the outcome in s3. However, this would need prior knowledge of the json file itself.
I define a parser module to process the data based on the log type. For every incoming log, my compute(kinesis?) directs data processing to corresponding parser module and emit data into s3.
However, I am starting to explore if any different storage engine(elastic etc.) will fit my usecase better. I am wondering if anyone as run into such usecases and what did you find helpful in solving the problem

Using Couchbase SDK vs Sync Gateway API

I have a full deployment of couchbase (server, sync gateway and lite) and have an API, mobile app and web app all using it.
It works very well, but I was wondering if there are any advantages to using the Sync Gateway API over the Couchbase SDK? Specifically I would like to know if Sync Gateway would handle larger numbers of operations better than the SDK, perhaps an internal queue/cache system, but can't seem to find definitive documentation for this.
At the moment the API uses the C# Couchbase SDK and we use SyncGateway very little (only really for synchronising the mobile app).
First, some relevant background info :
Every document that needs to be synced over to Couchbase Lite(CBL) clients needs to be processed by the Sync Gateway (SGW). This is true whether a doc is written via the SGW API or whether it comes in via server write (N1QL or SDK). The latter case is referred to as "import processing” wherein the document that is written to the bucket (via N1QL) is read by SGW via DCP feed. The document is then processed by SGW and written back to the bucket with the relevant sync metadata.
Prerequisite :
In order for the SGW to import documents written directly via N1QL/SDK, you must enable “shared bucket access” and import processing as discussed here
Non-mobile documents :
If you have documents that are never going to be synced to the CBL clients, then choice is obvious. Use server SDKs or N1QL
Mobile documents (docs to sync to CBL clients) :
Assuming you are on SGW 2.x syncing with CBL 2.x clients
If you have documents written at server end that need to be synced to CBL clients, then consider the following
Server side write rate:
If you are looking at writes on server side coming in at sustained rates significantly exceeding 1.5K/sec (lets say 5K/sec), then you should go the SGW API route. While it's easy enough to do a bulk update via server N1QL query, remember that SGW still needs to keep up and do the import processing (what's discussed in the background).
Which means, if you are doing high volume updates through the SDK/N1QL, then you will have to rate limit it so the SGW can keep up (do batched updates via SDK)
That said, it is important to consider the fact that if SGW can't keep up with the write throughput on the DCP feed, it's going to result in latency, no matter how the writes are happening (SGW API or N1QL)
If your sustained write rate on server isn’t excepted to be significantly high, then go with N1QL.
Deletes Handling:
Does not matter. Under shared-bucket-access, deletes coming in via SDK or SGW API will result in a tombstone. Read more about it here
SGW specific config :
Naturally, if you are dealing with SGW specific config, creating SGW users, roles, then you will use the SGW API for that.
Conflict Handling :
In 2.x, it does not matter. Conflicts are handled on CBL side.
Challenge with SGW API
Probably the biggest challenge in a real-world scenario is that using the SG API path means either storing information about SG revision IDs in the external system, or perform every mutation as a read-then-write (since we don't have a way to PUT a document without providing a revision ID)
The short answer is that for backend operations, Couchbase SDK is your choice, and will perform much better. Sync Gateway is meant to be used by Mobile clients, with few exceptions (*).
Bulk/Batch operations
In my performance tests using Java Couchbase SDK and bulk operations from AsyncBucket (link), I have updated up to 8 thousand documents per second. In .Net there you can do Batch operations too (link).
Sync Gateway also supports bulk operations, yet it is much slower because it relies on REST API and it requires you to provide a _rev from the previous version of each document you want to update. This will usually result in the backend having to do a GET before doing a PUT. Also, keep in mind that Sync Gateway is not a storage unit. It just works as a proxy to Couchbase, managing mobile client access to segments of data based on the channels registered for each user, and writes all of it's meta-data documents into the Couchbase Server bucket, including channel indexing, user register, document revisions and views.
Querying
Views are indexed thus for querying of large data they may will respond very fast. Whenever a document is changed, the map function of all views has the opportunity to map it. But when a view is created through Sync Gateway REST API, some code is added to your map function to handle user channels/permissions, making it slower than plain code created directly in Couchbase Admin UI. Querying views with compound keys using startKey/endKey parameters is very powerful when you have hierarchical data, but this functionality and the use of reduce function are not available for mobile clients.
N1QL can also be very fast too, when your N1QL query is taking advantage of Couchbase indexes.
Notes
(*) One exception to the rule is when you want to delete a document and have this reflected on mobile phones. The DELETE operation, leaves an empty document with _deleted: true attribute, and can only be done through Sync Gateway. Next time the mobile device synchronizes and finds this hint, it will delete the document from local storage. You can also use set this attribute through a PUT operation, when you may also adding _exp: "2019-12-12T00:00:00.000Z" attribute to perform a programmed purge of the document in a future date, so that the server also gets clean. However, just purging a document through Sync Gateway is equivalent to delete it through Couchbase SDK and this won't reflect on mobile devices.
NOTE: Prior to Sync Gateway 1.5 and Couchbase 5.0, all backend operations had to be done directly in Sync Gateway so that Sync Gateway and mobile clients could detect those changes. This has changed since shared_bucket_access option was introduced. More info here.

Connect Adobe Analytics to MYSQL

I am trying to connect the data collected from Adobe Analytics to my local instance of MYSQL, is this possible? if so what would be the method of doing so?
There isn't a way to directly connect your mysql db with AA, make queries or whatever.
The following is just some top level info to point you in a general direction. Getting into specifics is way too long and involved to be an answer here. But below I will list some options you have for getting the data out of Adobe Analytics.
Which method is best largely depends on what data you're looking to get out of AA and what you're looking to do with it, within your local db. But in general, I listed them in order of level of difficulty of setting something up for it and doing something with the file(s) once received, to get them into your database.
First option is to within the AA interface, schedule data to be FTP'd to you on a regular basis. This can be a scheduled report from the report interface or from Data Warehouse, and can be delivered in a variety of formats but most commonly done as a CSV file. This will export data to you that has been processed by AA. Meaning, aggregated metrics, etc. Overall, this is pretty easy to setup and parse the exported CSV files. But there are a number of caveats/limitations about it. But it largely depends on what specifically you're aiming to do.
Second option is to make use of their API endpoint to make requests and receive response in JSON format. Can also receive it in XML format but I recommend not doing that. You will get similar data as above, but it's more on-demand than scheduled. This method requires a lot more effort on your end to actually get the data, but it gives you a lot more power/flexibility for getting the data on-demand, building interfaces (if relevant to you), etc. But it also comes with some caveats/limitations same as first option, since the data is already processed/aggregated.
Third option is to schedule Data Feed exports from the AA interface. This will send you CSV files with non-aggregated, mostly non-processed, raw hit data. This is about the closest you will get to the data sent to Adobe collection servers without Adobe doing anything to it, but it's not 100% like a server request log or something. Without knowing any details about what you are ultimately looking to do with the data, other than put it in a local db, at face value, this may be the option you want. Setting up the scheduled export is pretty easy, but parsing the received files can be a headache. You get files with raw data and a LOT of columns with a lot of values for various things, and then you have these other files that are lookup tables for both columns and values within them. It's a bit of a headache piecing it all together, but it's doable. The real issue is file sizes. These are raw hit data files and even a site with moderate traffic will generate files many gigabytes large, daily, and even hourly. So bandwidth, disk space, and your server processing power are things to consider if you attempt to go this route.

Meteor performance comparison of publishing static data vs. getting data via HTTP Get request

I am building an app that receives a bunch of static data that is read only. The user does not change the data, or send any data to the server. The app just gets the data and presents it to the user in various views.
Like for example a parts list, with part numbers and prices. This data is currently stored in mongoDB.
I have few options for getting the data to the client. I could just use meteor's publication system, and have the client subscribe to the data it needs.
Or I could map all the data the client needs into one JSON file, save the JSON file to Amazon S3, and have the client make simple GET request to grab the data.
If we wanted this app to scale to many, many users, would not using meteor publication be the best? Or would either method be similar in terms of performance? Using meteor publication system would be the easiest, but I am worried that going down this route would lead to performance issues if a lot of clients request the data. If the performance between publishing and get request is about the same, I would just stick with the publication as its the easiest.
In this case Meteor will provide better performance. If your data is mostly server to client driven then clients do not have to worry about polling the server and the server will not have to worry about handling the request.
Also Meteor requires very little resources to send data to the client because the connection is persistent. Take an app like code fights which is built on Meteor constantly has thousands of connections to and from it, its performance runs great.
As a side note, if you are ready to serve your static data as a JSON file in a separate server (AWS S3), then it means you do not expect that data to be that big, so that it can be handled in a single file and entirely loaded in client's memory.
In that case, you might even want to reconsider the need to perform any separate request (whether HTTP or Meteor Pub/Sub).
For instance, simply embedding the data in your app, or served through SSR / Fast Render package.
Then if you are really concerned about your scalability, you might even reconsider the need to use Meteor, since you do not seem to need any client-server interactivity (no real need for Pub/Sub, no reactivity…). After your prototype is ready, you could rework it as a separate and static SPA, so that you do not even need to serve it through Node / Meteor.

Integrating systems and snapshotting data exchange

I wanted to know, if in general, when integrating 2 or more systems via whatever means (ie. webservice, MQ, etc.), is it a best practice or a standard for your system to capture a snapshot of data that you are sending with another system? I am thinking that this is as an insurance when reconciling is required for scenarios such as prod incidents.
Secondly, I would think this data snapshot is different from audit trail, in that the data being sent itself is saved (ie. xml data, csv file) as a LOB column in a snapshot table. Is this redundant with the audit trail?
For your first question ...
I've done many, many integrations using queues, web services, etc. and I will usually store an audit trail (a high level set of data telling me what happened), but I've never actually stored the payload itself for each call.
A few reasons for that:
The storage of the payloads being sent back and forth can get quite large.
I can usually reconstruct the payload using the audit trail. "Oh entity XYZ with ID 123 was sent yesterday. Let's take a look at what that entity looks like."
If you do the integration really well and have good testing around it, having copies of the payloads becomes unnecessary.
Instead of storing a copy of the payload I would focus on these things for integration:
Good unit tests on both sides and integration testing for the entire process.
Audit logs as you mentioned.
Good retry policies when a message fails (specifically for queues and topics).
Focusing on idempotent messages. So if something fails, you just do it again and everything is ok.