Amazon SWF ActivityWorker - workerForCommonTaskList vs workerForHostSpecificTaskList - aws-sdk

In Amazon SWF what is the difference between the following kinds of ActivityWorker - workerForCommonTaskList vs workerForHostSpecificTaskList. Does it mean the workerForHostSpecificTaskList picks up tasks on a list meant to be executed on the host where the ActivityWorker is running? If so how does one add tasks to such a list?

A task list is essentially a queue. SWF supports an unlimited number of task lists and they are created on demand without an explicit registration. It is up to an application to decide how many workers consume from a task list. The common design pattern is to have a common task list that worker on every host listens to and a host specific task list per host (or mesos or cubernetis task or even process instance). Then one of the activity tasks that was dispatched to a common task list returns the host specific task list name and then activities can be scheduled to a host specific task list to route activity executions on a specific host.
How activity is scheduled to a specific task list is client side library specific. In Java AWS Flow framework it is done by passing it to an ActivitySchedulingOptions structure that can be passed as an additional parameter to an activity invocation.
See fileprocessing sample that demonstrates such routing.
BTW, have you looked at the Cadence, which is open source alternative to SWF which is actively developed and much more feature reach than SWF?

Related

Enterprise Application Integration Styles in 2023?

As per Enterprise Integration Patterns book written in 2003: Following four Integration styles has been mentioned:
1. File Transfer — One application writes a file that another later reads. The applications
need to agree on the filename and location, the format of the file, the timing of when it
will be written and read, and who will delete the file.
2. Shared Database — Multiple applications share the same database schema, located in a
single physical database. Because there is no duplicate data storage, no data has to be
transferred from one application to the other.
3. Remote Procedure Invocation — One application exposes some of its functionality so that it
can be accessed remotely by other applications as a remote procedure. The
communication occurs real-time and synchronously.
4. Messaging — One applications publishes a message to a common message channel. Other
applications can read the message from the channel at a later time. The applications must agree on a channel as well as the format of the message. The communication is
asynchronous.
In 2023; Is it same or more styles have been added till this year? What relatable topic can be searched?
Also can I consider "Rest and Soap Service" comes under Remote Procedure Invocation?

designing an agnostic configuration service

Just for fun, I'm designing a few web applications using a microservices architecture. I'm trying to determine the best way to do configuration management, and I'm worried that my approach for configuration may have some enormous pitfalls and/or something better exists.
To frame the problem, let's say I have an authentication service written in c++, an identity service written in rust, an analytics services written in haskell, some middletier written in scala, and a frontend written in javascript. There would also be the corresponding identity DB, auth DB, analytics DB, (maybe a redis cache for sessions), etc... I'm deploying all of these apps using docker swarm.
Whenever one of these apps is deployed, it necessarily has to discover all the other applications. Since I use docker swarm, discovery isn't an issue as long all the nodes share the requisite overlay network.
However, each application still needs the upstream services host_addr, maybe a port, the credentials for some DB or sealed service, etc...
I know docker has secrets which enable apps to read the configuration from the container, but I would then need to write some configuration parser in each language for each service. This seems messy.
What I would rather do is have a configuration service, which maintains knowledge about how to configure all other services. So, each application would start with some RPC call designed to get the configuration for the application at runtime. Something like
int main() {
AppConfig cfg = configClient.getConfiguration("APP_NAME");
// do application things... and pass around cfg
return 0;
}
The AppConfig would be defined in an IDL, so the class would be instantly available and language agnostic.
This seems like a good solution, but maybe I'm really missing the point here. Even at scale, tens of thousands of nodes can be served easily by a few configuration services, so I don't forsee any scaling issues. Again, it's just a hobby project, but I like thinking about the "what-if" scenarios :)
How are configuration schemes handled in microservices architecture? Does this seem like a reasonable approach? What do the major players like Facebook, Google, LinkedIn, AWS, etc... do?
Instead of building a custom configuration management solution, I would use one of these existing ones:
Spring Cloud Config
Spring Cloud Config is a config server written in Java offering an HTTP API to retrieve the configuration parameters of applications. Obviously, it ships with a Java client and a nice Spring integration, but as the server is just a HTTP API, you may use it with any language you like. The config server also features symmetric / asymmetric encryption of configuration values.
Configuration Source: The externalized configuration is stored in a GIT repository which must be made accessible to the Spring Cloud Config server. The properties in that repository are then accessible through the HTTP API, so you can even consider implementing an update process for configuration properties.
Server location: Ideally, you make your config server accessible through a domain (e.g. config.myapp.io), so you can implement load-balancing and fail-over scenarios as needed. Also, all you need to provide to all your services then is just that exact location (and some authentication / decryption info).
Getting started: You may have a look at this getting started guide for centralized configuration on the Spring docs or read through this Quick Intro to Spring Cloud Config.
Netflix Archaius
Netflix Archaius is part of the Netflix OSS stack and "is a Java library that provides APIs to access and utilize properties that can change dynamically at runtime".
While limited to Java (which does not quite match the context you have asked), the library is capable of using a database as source for the configuration properties.
confd
confd keeps local configuration files up-to-date using data stored in external sources (etcd, consul, dynamodb, redis, vault, ...). After configuration changes, confd restarts the application so that it can pick up the updated configuration file.
In the context of your question, this might be worthwhile to try as confd makes no assumption about the application and requires no special client code. Most languages and frameworks support file-based configuration so confd should be fairly easy to add on top of existing microservices that currently use env variables and did not anticipate decentralized configuration management.
I don't have a good solution for you, but I can point out some issues for you to consider.
First, your applications will presumably need some bootstrap configuration that enables them to locate and connect to the configuration service. For example, you mentioned defining the configuration service API with IDL for a middleware system that supports remote procedure calls. I assume you mean something like CORBA IDL. This means your bootstrap configuration will not be just the endpoint to connect to (specified perhaps as a stringified IOR or a path/in/naming/service), but also a configuration file for the CORBA product you are using. You can't download that CORBA product's configuration file from the configuration service, because that would be a chicken-and-egg situation. So, instead, you end up with having to manually maintain a separate copy of the CORBA product's configuration file for each application instance.
Second, your pseudo-code example suggests that you will use a single RPC invocation to retrieve all the configuration for an application in a single go. This coarse level of granularity is good. If, instead, an application used a separate RPC call to retrieve each name=value pair, then you could suffer major scalability problems. To illustrate, let's assume an application has 100 name=value pairs in its configuration, so it needs to make 100 RPC calls to retrieve its configuration data. I can foresee the following scalability problems:
Each RPC might take, say, 1 millisecond round-trip time if the application and the configuration server are on the same local area network, so your application's start-up time is 1 millisecond for each of 100 RPC calls = 100 milliseconds = 0.1 second. That might seem acceptable. But if you now deploy another application instance on another continent with, say, a 50 millisecond round-trip latency, then the start-up time for that new application instance will be 100 RPC calls at 50 milliseconds latency per call = 5 seconds. Ouch!
The need to make only 100 RPC calls to retrieve configuration data assumes that the application will retrieve each name=value pair once and cache that information in, say, an instance variable of an object, and then later on access the name=value pair via that local cache. However, sooner or later somebody will call x = cfg.lookup("variable-name") from inside a for-loop, and this means the application will be making a RPC every time around the loop. Obviously, this will slow down that application instance, but if you end up with dozens or hundreds of application instances doing that, then your configuration service will be swamped with hundreds or thousands of requests per second, and it will become a centralised performance bottleneck.
You might start off writing long-lived applications that do 100 RPCs at start-up to retrieve configuration data, and then run for hours or days before terminating. Let's assume those applications are CORBA servers that other applications can communicate with via RPC. Sooner or later you might decide to write some command-line utilities to do things like: "ping" an application instance to see if it is running; "query" an application instance to get some status details; ask an application instance to gracefully terminate; and so on. Each of those command-line utilities is short-lived; when they start-up, they use RPCs to obtain their configuration data, then do the "real" work by making a single RPC to a server process to ping/query/kill it, and then they terminate. Now somebody will write a UNIX shell script that calls those ping and query commands once per second for each of your dozens or hundreds of application instances. This seemingly innocuous shell script will be responsible for creating dozens or hundreds of short-lived processes per second, and of those short-lived processes will make numerous RPC calls to the centralised configuration server to retrieve name=value pairs one at a time. That sort of shell script can pu a massive load on your centralised configuration server.
I am not trying to discourage you from designing a centralised configuration server. The above points are just warning about scalability issues you need to consider. Your plan for an application to retrieve all its configuration data via one coarse-granularity RPC call will certainly help you to avoid the kinds of scalability problems I mentioned above.
To provide some food for thought, you might want to consider a different approach. You could store each application's configuration files on a web sever. A shell start script "wrapper" for an application can do the following:
Use wget or curl to download "template" configuration files from the web server and store the files on the local file system. A "template" configuration file is a normal configuration file but with some placeholders for values. A placeholder might look like ${host_name}.
Also use wget or curl to download a file containing search-and-replace pairs, such as ${host_name}=host42.pizza.com.
Perform a global search-and-replace of those search-and-replace terms on all the downloaded template configuration files to produce the configuration files that are ready to use. You might use UNIX shell tools like sed or a scripting language to perform this global search-and-replace. Alternatively, you could use a templating engine like Apache Velocity.
Execute the actual application, using a command-line argument to specify the path/to/downloaded/config/files.

How to manage server-side processes using MySQL

I have a perl script which takes in unique parameters (one of the parameters being --user=username_here). Users can start these processes using a web interface I am developing.
A MySQL table, transactions, keeps track of users that run the perl script
id user script_parameters execute last_modified
23 alex --user=alex --keywords=thisthat 0 2014-05-06 05:49:01
24 alex --user=alex --keywords=thisthat 0 2014-05-06 05:49:01
25 alex --user=alex --keywords=lg 0 2014-05-06 05:49:01
26 alex --user=alex --keywords=lg 0 2014-04-30 04:31:39
The execute value for a given row will be "1" if the process should be running. It is set to "0" if the process should be ended.
My perl script constantly checks this value to make sure it's not "0" and if it is, the perl script terminates.
However, I need to manage these process to protect against this problem:
What if my server abruptly crashes and restarts, OR the script crashes? I will need something running in the background, reading the transactions table and make sure it restarts the perl script as many times as needed using the appropriate parameters.
And so, I'm having trouble figuring out how to balance giving control to the user to manage his/her own transaction(s), while I also make sure that the transactions that SHOULD be running, ARE running, and those that AREN'T, AREN'T.
Hope that makes sense and I appreciate any help!
It seems you're trying to launch long-running processes from a web server and then track those processes in a database. That's not impossible, but not a recommended practice.
The main problem is that an HTTP request needs to be currently being handled in your web server for you do actually do anything (including track processes running on the system) -- you need something that can run all the time...
Instead, a better idea would be to have another daemonized "manager" process (as you mention perl, that'd be a good language to write it in) spawn & track the long running tasks (by PID and signals), and for that process to update your SQL database.
You can then have your "manager" process listen for requests to start a new process from your web server. There are various IPC mechanisms you could use. (e.g: signals, SysV shm, unix domain sockets, in-process queues like ZeroMQ, etc).
This has multiple benefits:
If your spawned scripts need to run with user/group based isolation (either from the system or each other), then your webserver doesn't need to run as root, nor be setgid.
If a spawned process "crashes", a signal will be delivered to the "manager" process, so it can track mis-executiions without issues.
If you use in-process queues (e.g: ZeroMQ) to deliver requests to the "manager" process, it can "throttle" requests from the web server (so that users cannot intentionally or accidentally cause D.O.S).
Whether or not the spawned process ends well, you don't need an 'active' HTTP request to the web server in order to update your tracking database.
As to whether something that should be running is running, that's really up to your semantics. (i.e: is it based on a known run time? based on data consumed? etc).
The check as to whether it is running can be two-fold:
The "manager" process updates the database as appropriate, including the spawned PID.
Your web server hosted code can actually list processes to determine if the PID in the database is actually running, and even how much time it's been doing something useful!
The check for whether it is not running would have to be based on convention:
Name the spawned processes something you can predict.
Get a process list to determine what's still running (defunct?) that shouldn't be.
In either case, you could either inform the users who requested the processes be spawned and/or actually do something about it.
One approach might be to have a CRON job which reads from the SQL database and does ps to determine which spawned processes need to be restarted, and then re-requests that the "manager" process does so using the same IPC mechanism used by the web server. How you differentiate starts vs. restarts in your tracking/monitoring/logging is up to you.
If the server itself loses power or crashes, then you could have the "manager" process perform cleanup when it first runs, e.g:
Look for entries in the database for spawned processes that were alegedly running before the server was shut down.
Check for those processes by PID and run time (this is important).
Either re-spawn the spawned proceses that didn't complete, or store something in the database to indicate to the web server that this was the case.
Update #1
Per your comment, here are some pointers to get started:
You mentioned perl, so presuming you have some proficiency there -- here are some perl modules to help you on your way to writing the "manager" process script:
If you're not already familiar with it CPAN is the repository for perl modules that do basically anything.
Daemon::Daemonize - To daemonize process so that it will continue running after you log out. Also provides methods for writing scripts to start/stop/restart the daemon.
Proc::Spawn - Helps with 'spawning' child scripts. Basically does fork() then exec(), but also handles STDIN/STDOUT/STDERR (or even tty) of child process. You could use this to launch your long-running perl scripts.
If your web server front-end code is not already written in perl, you'll need something that's pretty portable for inter-process message-passing and queuing; I'd probably make your web server front end in something easy to deploy (like PHP).
Here are two possibilities (there are many more):
Perl and PHP implementations for the Spread Toolkit.
Perl and PHP implementations for the ZeroMQ library.
Proc::ProcessTable - You can use this check on running processes (and get all sorts of stats as discussed above).
Time::HiRes - Use the high-granularity time functions from this package to implement your 'throttling' framework. Basically just limit the number of requests you de-queue per unit of time.
DBI (with mysql) - Update your MySQL database from the "manager" process.

How to solve batch processing in ESB?

We have a legacy system that produces files that each contains hundreds of messages (financial transactions). We need to transform these messages into another format and submit them (individually) to a target system. The question is:
Should ESB accept these files for processing directly, or should there be an adapter application between the legacy system and ESB that would split received files into individual messages and let the ESB to process the messages individually (instead of processing the whole file)?
In the first solution we expect two ESB flows. The first one would transform the file into a new format, split it into the messages, and store these messages into a temporary location. The transformation needs to process the file as a whole, because the file contains some common sections that are needed for transformation of the individual messages.
The second flow would take the individual transformed messages (each in a separate DB transaction), pass them to the target system, and wait for its answer (synchronously or asynchronously).
The second solution would replace the first flow by an external application that would transform the file, split it into individual transformed messages, and store them in a temporary location (local file system). The second flow would stay in the ESB.
In our eyes, the disadvantage of the first solution is in that the ESB would have to process huge files (in the first flow), which is commonly considered an antipattern. On the other hand, the ESB would adjust directly to the interface of the legacy system, which is one of the purposes of ESB.
In the second solution, the adapter application would contain the transformation logic, which should be another of the purposes and responsibilities of ESB.
What is the commonly suggested solution for this situation (a pattern)? Could you provide some references that are more descriptive than these two links that I've found?
http://publib.boulder.ibm.com/infocenter/esbsoa/wesbv7r5/index.jsp?topic=%2Fcom.ibm.websphere.wesb.programming.doc%2Ftopics%2Fesbprog_patterns.html
https://www.ibm.com/developerworks/wikis/display/esbpatterns/File+Processing
Edit
Another reference:
http://www.ibm.com/developerworks/webservices/library/ws-largemessaging/
Remember that there are 3 message types in SOA: Command, Event, Document
That 'Document' bit is for chunks of data. It is probably better suited to 'real' document types such as 'Order' or 'Invoice' and the like but there is nothing stopping you from going with 'TransactionBatch'.
That being said, it is a rather unused message type in that not many service buses actually implement anything around it, since:
you do not really need it
many message queuing technologies have limits on message size (as low as 4kb) making it difficult to transport any large message (needs to be sent in chunks)
So what I would do in your scenario is have an endpoint that processes the file. So something like a ProcessTransactionFileCommand sent to the processing endpoint and in it you only have a reference to the actual file (stored somewhere in the file system or even a url to download from). That processing endpoint can process the file and send the individual messages (all within a transaction) to the integration endpoint that sends the message off to the external system. You could have a SendTransactionCommand to do that.
In this way your system is very flexible in that the integration endpoint can receive individual integration commands form some parts of your solution while the processing endpoint can handle the batch and split them into individual integration commands.
Should you be in the .NET space you may want to look at my FOSS service bus project: http://shuttle.codeplex.com/
But any service bus will do the trick (MassTransit, NServiceBus, etc.)
You can use an ESB for the first case and I don't think it would be an anti-pattern. The purpose of an ESB it's also to integrate legacy applications, that create files as output as in your use case, with other applications.
You can try Mule ESB. It will allow you to consume the file using streaming (through the file transport), map the content of the file to your desired output using a GUI called DataMapper and finally put those messages en a VM queue which can be a persistent queue within the ESB. This queues are transactional so you can guarantee that all the messages created from one file were put on the VM queue or none of them.
Then you can, from another flow (in fact processed within the ESB are called flows in mule) read each of those messages and process them.
HTH, Pablo.

Can MS Enterprise Library Logging be used for multiple applications?

I'm wondering if its - a) possible; b) good practice - to log multiple applications to single log instance?
I have several ASP.NET apps and I would like to aggregate all exceptions to a centralized location that can be queried as part of an Enterprise Dashboard app. I'm using both the EL logging block and the EL exception blog along with the Database Trace Listener. I would like to see exceptions across all apps logged to a single db.
Any comments, best practice guidelines or answers would be extremely welcome.
Yes, it is definitely possible to store multiple application logs in a central location using EL.
An Enterprise Dashboard application that lets you view exceptions across applications and tiers, and provides reporting is a great reason to centralize your logging. So I'll say yes to question b as well.
Possible Issues/Negatives
I'm assuming that you are using the Database Trace Listener since you mention that in your question. If there are a large number of applications logging a large number of log entries combined with users querying the (potentially large) log database, there is the potential for performance to degrade (since the logging is done synchronously) which could impact your application performance.
Another Approach
To mitigate against this possibility, I would investigate using the Distributor Service to log asynchronously. In that model, all of the applications would log to a message queue (using the MSMQ Trace Listener). A separate service then polls the queue and would forward the log entries to a trace listener (in your case a Database Trace Listener) which would persist the messages in your dashboard database. This setup is more complicated. But it does seem to align with what you are trying to achieve and has some other benefits such as asynchronous processing and the ability to log even if the dashboard database is down (e.g. for maintenance).
Other Considerations
You may also want to think about standardizing some LogEntry properties across applications. For example, LogEntry doesn't really have an "application" property so you could add an ExtendedProperty to represent the application name. Or you may standardize on a specific format for the Message property so that various information can be pulled out of the message and stored in separate database columns for easier searching and categorization.