We have main orchestration that has multiple sub orchestration. All root orchestration is of transaction type:none, hence all the sub are also of same nature. Now any exception is caught in a parent scope of main orchestration and we have some steps like logging. The orchestration is activated with a message from App SQL. So every time an exception occurs, say due to something intermittent, like unable to connect to web service. We later go manually re-trigger.
I'm looking at modifying the orch to be self healing, say from exception catch block it reinitialize the messages based on conditions that tell, the issue was intermittent. Something like app issue-null reference, we would not want to resend message, because, the orch is never going to work.
There is a concept called compensation, but that is for transaction based orch- do n steps if any 1 fails, do m other steps(which would do alternate action or cleanup).
The only idea I have is do a look-up based on keywords in exception and decide to resend messages. But I want some1 to challenge this or suggest a better approach
I have always thought that it's better to handle failures offline. So if the orchestration fails, terminate it. But before you terminate, send a message out. This message will contain all the information necessary to recover the message processing if it turns out that there was a temporary problem which caused the failure. The message can be consumed by a "caretaker" process which is responsible for recovery.
This is similar to how the Erlang OTP framework approaches high availability. Processes fail quickly and caretaker processes make sure recovery happens.
Related
I'm trying to understand how the partitions are executing the events when there is retry policy in place for the event hub and I can't find an answer to what happens to new events when one got an error and is retrying in the same partition in the event hub?
I'm guessing that the one that got an error shouldn't block new ones from executing and when it reties it should be put at the end of the partition, so any other events that got in the partition after that event got an error should be executed in order without any blockage.
Can someone explain what is actually happening in a scenario like that?
Thanks.
It's difficult to answer precisely without some understanding of the application context. The below assumes the current generation of the Azure SDK for .NET, though conceptually the answer will be similar for others.
Retries during publishing are performed within the client, which treats each publishing operation an independent and isolated. When your application calls SendAsync, the client will attempt to publish them and will apply its retry policy in the scope of that call. When the SendAsync call completes, you'll have a deterministic answer of whether the call succeeded or failed.
If the SendAsync call throws, the retry policy has already been applied and either the exception was fatal or all retries were exhausted. The operation is complete and the client is no longer trying to publish those events.
If your application makes a single SendAsync call then, in the majority of cases, it will understand the outcome of the publishing operation and the order of events is preserved. If your application is calling SendAsync concurrently, then it is possible that events will arrive out of order - either due to network latency or retries.
While the majority of the time, the outcome of a call is fully deterministic, some corner cases do exist. For example, if the SendAsync call encounters a timeout, it is ambiguous whether or not the service received the events. The client will retry, which may produce duplicates. If your application sees a TimeoutException surface, then it cannot be sure whether or not the events were successfully published.
I've been trying to read more about what to do properly catching / handling exceptions, but I don't think I've got it down. In fact, I think I'm getting much more confused and possibly implementing bad code. I don't want to do that.
An example setup that I have been using:
Mobile device makes a call to the WCF Service.
WCF Service retrieves the data from the database, and if any errors occur on the database level, they are logged and I am sent an e-mail.
WCF Service sends data (or a brief description of the exception) to the mobile device.
The mobile device processes the data, and if any error occurs, throws the error up to the UI layer.
For a few of the exceptions, I created custom ones - service exception, authorization exception, so I can properly notify the user. If the service encountered an error or an IOException occurs, the user will be notified that 'the data could not be retrieved.'
If, however, another error occurs - such as a JSON error, or anything like that 'just in case', the error is thrown to the UI layer and simply caught as Exception, since we don't really need to user to know what happened, but that an error occurred.
Is this appropriate exception handling?
Are you seeing any problems?
In general, it makes sense to have some sort of catch-all that allows the user to keep working. This should be combined with appropriate handling for any showstoppers, to let the user down gracefully, and catch anything else that would make proceeding dangerous.
"Appropriate exception handling" is always going to be a) application dependent and b) subjective - so there's no definitive answer.
In general I would say you need to do all of the following:
Specifically address and handle appropriately all likely exceptions.
Provided a catch all to prevent a non-graceful termination.
Notify the user of unexpected errors if there is potential it will effect
their data or usage (i.e. - don't mask errors that might impact user)
Sounds like you've done this so I believe you have a reasonable approach in place.
We had a case when exceptions had gone in some kind of infinite loop.
Stack traces were very big and we log all of them.
That flood our Oracle database and when redo logs reached their size limit db stopped.
EDIT: Of course that the most important thing is to find the cause of infinite loop an correct the bug in the system. We already did that and that is not the question here.
The system could have more bugs like that (it's an windows service and it's running constantly) and in that case one app broke the whole DB, meaning all applications on that Oracle DB.
I'm mostly interested in your experiences, architecturally. And that from other logging frameworks like log4net, log4j and others. How do they handle flood of exceptions ? Just handle them like all other exceptions ?
I think your situation illustrates that there should definitely be some mechanism in place to prevent exception logs from causing a denial-of-service anywhere, as this has done.
If you use the Windows event logs, this can be handled for you automatically, as old records can automatically be wiped out when the log is full. You could code a DB-based system to do the same thing, as well.
Of course, you want to do everything you can to eliminate such errors in the first place where ever possible, too!
Another option may be to detect and ignore multiple, consecutive errors of the same time... perhaps simply updating a count property/field instead.
I'd worry more about the root cause of the infinite loop then I would about limiting logging.
I'd check your code for methods that catch an exception, log the stack trace, and re-throw. I'd argue that catching and re-throwing is not exception handling. If a class truly can't handle the exception, it's better to let it simply bubble up until it reaches a single point where someone can deal with it.
Redo logs? How often do you flush those? Surely you don't have one big transaction, do you?
Can you do the logging to a different database with no redo logs? That will protect the production database.
In our applications whe have a central exceptionhandler where all execeptions go through
void OnExceptionOccurs(Exception ex,
string enduserFriendlyContextDescription,
string tecnicalContextDescription,
ILogger loggerBelongingToProcess)
that handler can decide how to log and you have a central location for breakpoint when debugging
please verify me if I am right: when a program encounters a exception we should write code to handle that exception, the handler should do some recovery job, like rerun the program, but is not just telling us where we went wrong in real world application.
When you throw an exception you're saying:
"Something happened and I can't handle it here. I'm passing the buck!"
When you catch you say:
"Ok I can do something. Perhaps loop around and try again; maybe log an error message".
You can even choose to ignore the error but that's usually discouraged.
In general the thrower captures the state of the failure and the catcher acts on the failure.
In real life, exceptions don't always make error handling easier; but it helps keeps errors out of the main line code path. It's an alternate execution flow. You often hear this: "Exceptions should be used only for exceptional things."
This is a controversial topic, so I expect some to disagree with what I'm going to say.
Exceptions are for exceptional circumstances, namely, the following two classes of problems:
Program Bug
External Problems
In the former case, your program has gotten into a state it shouldn't be in. So you throw an exception and catch it at a level high enough where the program can gracefully continue. In general, this should be fairly high in your program. The reality is, if there's a bug in the middle of an operation, there's not much you can do to recover (after all, if you knew there was a bug, you'd fix it!). Best is to log it, let the user know and move on, if possible. Terminate the current operation, dialog, whatever, or even the whole program.
In the latter case, you are dealing with capricious and fickle universe. Things might go wrong through no fault of your own. In this case, you should try to detect errors as close to the source as you can and deal with them as best you can. If you sending an email to a flaky server results in an exception, it might be reasonable to try again (warning the user). If the database connection goes down, you could try again, but it might be better to give up and kill the current operation. It depends on how well you understand the external problems that might arise and what can actually be done about them.
If you have known error conditions, such as errors in user input or other data sources (e.g., XML parse error, user picked wrong choice on a form, etc.), it's probably best not to throw an exception, but instead gather and report the errors in a more structured fashion. In one project of mine, I have an error reporter class that can gather errors without interrupting program flow. Then those errors can be reported to the user, or logged.
Often times, I think the best approach is not to catch the error, especially if you don't have a specific response for the error. In general, I think the approach of "catch and try again" is flawed. The cause should be identified and corrected. You shouldn't just keep ramming into a brick wall.
Exceptions should be thrown when, and only when, a method/property/whatever is unable to fulfill its contract. The only time an exception should be caught without either rethrowing it or wrapping it in a new exception and throwing that is when the method that caught the exception can fulfill its contract despite the inner method's failure to fulfill its contract. While it may be hard to determine the optimal contract for a routine, once the contract is in place, deciding whether to throw an exception will be easy: do what the contract says.
It really depends on the error and how you want to handle it. In a lot of my automation systems, if something goes wrong, I want the program to send an email out with a specific error and then terminate. Other times I want to catch the error and run a different process to back out data that I had previously entered.
Sounds like you have the general idea down.
What are the best practices for exceptions over remote methods?
I'm sure that you need to handle all exceptions at the level of a remote method implementation, because you need to log it on the server side. But what should you do afterwards?
Should you wrap the exception in a RemoteException (java) and throw it to the client? This would mean that the client would have to import all exceptions that could be thrown. Would it be better to throw a new custom exception with fewer details? Because the client won't need to know all the details of what went wrong. What should you log on the client? I've even heard of using return codes(for efficiency maybe?) to tell the caller about what happened.
The important thing to keep in mind, is that the client must be informed of what went wrong. A generic answer of "Something failed" or no notification at all is unacceptable. And what about runtime (unchecked) exceptions?
It seems like you want to be able to differentiate if the failure was due to a system failure (e.g. a service or machine is down) or a business logic failure (e.g. the user does not exist).
I'd recommend wrapping all system exceptions from the RMI call with your own custom exception. You can still maintain the information in the exception by passing it to your custom exception as the cause (this is possible in Java, not sure about other languages). That way client only need to know how to handle the one exception in the cause of system failure. Whether this custom exception is checked or runtime is up for debate (probably depends on your project standards). I would definitely log this type of failure.
Business type failures can be represented as either a separate exception or some type of default (or null) response object. I would attempt to recover (i.e. take some alternative action) from this type of failure and log only if the recovery fails.
In past projects we'd catch all service layer (tier) exceptions at the very top of the layer, passing the application specific error codes/information to the UI via DTO's/VO's. It's a simple approach in that there's an established pattern of all error handling happening in the same place for each service instead of scattered about the service and UI layers.
Then all the UI has to do is inspect the DTO/VO for a flag (hasError?) and display the error message(s), it doesn't have to know nor care what the actual exception was.
I would always log the exception within my application (at the server side as defined in your question).
I would then throw an exception, to be caught by the client. If the caller could take corrective action to prevent the exception then I would ensure that the exception contained this information (e.g. DateTime argName must not be in the past). If the error was caused by some outage of a third party system then I might pass this information up the call stack to the caller.
If, however, the exception was essentially caused by a bug in my system then I would structure my exception handling such that a non-informative exception message (e.g. General failure) was used.
Here's what I did. Every Remote Method implementation catches all Exceptions on the server side and logs them. Then they are wrapped in a Custom Exception, which will contain a description of the problem. This description must be useful to the client, so it won't contain all the details of the caught Exception, because the client doesn't need them. They have already been logged on the server side. Now, on the client, these Exceptions can be handled how the user wishes.
Why I chose using Exceptions and not return codes is because of one very important drawback of return codes: you can't throw them to higher levels without some effort. This means you have to check for an error right after the call and handle it there. But this may not be what I want.