Often we have ran into problems with custom TransactionProcessors, when the TP crashes or is unable to connect to the sawtooth Nodes we get a QUEUE_FULL error and from there on all transaction go into PENDING state, including intkey / settings.
Is there a way to remove PENDING transactions and clean up the queue or any cli that can clean up the batches / transactions that are in the queue.
Hyperledger Sawtooth validator attempts at executing transactions in the order they arrive, when there is a call from the Consensus engine. The question is discussing 2 distinct features, happy to help further.
Feature 1: The solution for Transaction Processor crash. It is expected that the transaction processor execute a transaction in the queue when consensus engine asks the validator to build a block. If for some reason the Transaction Processor is unable to process the message, the result of which is still unknown to the validator. So, the validator keeps it in pending state as long as it can be scheduled for execution. Right way for de-queuing it is by executing it. Either put it in a block if it's valid or remove from the queue if it is invalid.
Solution: Check why is the Transaction Processor crashing. This is the code you own. The validator expects one of the following responses - transaction is valid, transaction is invalid, transaction couldn't be evaluated and needs retry.
Feature 2: Removing pending batches from the queue deliberately without telling Hyperledger Sawtooth about it. The pending queue is in memory, it is not saved on disk. The crazy solution thus is to restart that particular instance of validator node.
Note: This may not be possible in certain cases because of the deployment model chosen. Ensure your network and deployment is able to handle node restart scenarios before doing it. There could be bad consequences if the TP crashed on one of the node instead of all. The effect of which will make this particular validator send wrong result to reach the consensus, and depending on the consensus algorithm and the network size the handling of this error may happen differently. The clean solution however is to restart the Transaction Processor.
Hope this answer helps! Happy blockchaining..
Related
I'm trying to understand how the partitions are executing the events when there is retry policy in place for the event hub and I can't find an answer to what happens to new events when one got an error and is retrying in the same partition in the event hub?
I'm guessing that the one that got an error shouldn't block new ones from executing and when it reties it should be put at the end of the partition, so any other events that got in the partition after that event got an error should be executed in order without any blockage.
Can someone explain what is actually happening in a scenario like that?
Thanks.
It's difficult to answer precisely without some understanding of the application context. The below assumes the current generation of the Azure SDK for .NET, though conceptually the answer will be similar for others.
Retries during publishing are performed within the client, which treats each publishing operation an independent and isolated. When your application calls SendAsync, the client will attempt to publish them and will apply its retry policy in the scope of that call. When the SendAsync call completes, you'll have a deterministic answer of whether the call succeeded or failed.
If the SendAsync call throws, the retry policy has already been applied and either the exception was fatal or all retries were exhausted. The operation is complete and the client is no longer trying to publish those events.
If your application makes a single SendAsync call then, in the majority of cases, it will understand the outcome of the publishing operation and the order of events is preserved. If your application is calling SendAsync concurrently, then it is possible that events will arrive out of order - either due to network latency or retries.
While the majority of the time, the outcome of a call is fully deterministic, some corner cases do exist. For example, if the SendAsync call encounters a timeout, it is ambiguous whether or not the service received the events. The client will retry, which may produce duplicates. If your application sees a TimeoutException surface, then it cannot be sure whether or not the events were successfully published.
Required starts a new transaction while Supported joins an existing transaction.
However, if a transaction does not already exist, then does the Supported option create a new transaction?
This MSDN link suggests that it does not; where as this Microsoft training video at 36:36 says that it does.
The msdn documentation and the video are consistent
Required Make a transaction
Supported Enlist in available transaction
NotSupported Ignore any available transaction
At the 36 minute mark, the video is discussing SSIS Checkpoints which is more like a bookmark for the package. They record the last executed step for a run. For anything but the most trivial of packages, I advise against using SSIS Checkpoints as they're flaky, unreliable and cantankerous.
Instead, design your packages with restartability in mind. Task X fails - how does your package deal with it if it is restarted? Can it clean up any hanging/incomplete work? Can it identify the work has been done and skip it/perform no work?
The comments indicate
The slide at 36:36 reads as: "Supported joins an existing transaction or starts a new one". So is this correct or not?
The slide is incorrect. If you don't believe the people that wrote the documentation, read the entirety of the internet on the topic and you'll discover everyone saying the same thing. Either this youtuber is a savant or they are wrong. You can evaluate the truthfulness of my answer and everyone else by firing up the distributed transaction coordinator, DTC, and watch as the package runs under Supported and Required transaction levels. You'll be able to observe that DTC has work to do under Required and none in Supported/NotSupported
https://www.mssqltips.com/sqlservertip/1585/how-to-use-transactions-in-sql-server-integration-services-ssis/
https://sqlblogging.com/2011/10/17/transactions-in-ssis-with-example/
TransactionOption in SSIS
https://microsoft-ssis.blogspot.com/2011/01/ssis-transactions.html
https://social.msdn.microsoft.com/Forums/en-US/89738285-d797-4b09-b618-7bf51cc6228c/ssis-transaction-option
https://sqlstudies.com/2016/01/06/msdtc-requirements-for-ssis-transactions/
I have a strange issue which is causing a serious double-booking problem for us.
We have an MQ.NET code written in C# running on a Windows box that has MQ Client v 7.5. The code is placing messages on the MQ queue. Once in a while the "put" operation works and the message is placed on the, but the MQException is still thrown with Error Code 2009.
In this case, the program assumes that the put operation failed and places the same message on the queue again, which is not a desirable scenario. The assumption is that if the "put" resulted in MQException the operation has failed. Any idea how to avoid this issue from happening? See the client code below.
queue = queueManager.AccessQueue(queueName, MQC.MQOO_OUTPUT + MQC.MQOO_FAIL_IF_QUIESCING);
queueMessage = new MQMessage();
queueMessage.CharacterSet = 1208;
var utf8Enc = new UTF8Encoding();
byte[] utf8String = Encoding.UTF8.GetBytes(strInputMsg);
queueMessage.WriteBytes(Encoding.UTF8.GetString(utf8String).ToString());
queuePutMessageOptions = new MQPutMessageOptions();
queue.Put(queueMessage, queuePutMessageOptions);
Exception:
MQ Reason code: 2009, Exception: Error in the application.
StackTrace: at IBM.WMQ.MQBase.throwNewMQException()
at IBM.WMQ.MQDestination.Open(MQObjectDescriptor od)
at IBM.WMQ.MQQueue..ctor(MQQueueManager qMgr, String queueName, Int32 openOptions, String queueManagerName, String dynamicQueueName, String alternateUserId)
at IBM.WMQ.MQQueueManager.AccessQueue(String queueName, Int32 openOptions, String queueManagerName, String dynamicQueueName, String alternateUserId)
at IBM.WMQ.MQQueueManager.AccessQueue(String queueName, Int32 openOptions)
There is always an ambiguity of outcomes when using any async messaging over the network. Consider the following steps in the API call:
The client sends the API call to the server.
The server executes the API call.
The result is returned to the client.
Let's say the connection is lost prior or during #1 above. The application gets the 2009 and the message is never sent.
But what if the connection is lost after #1? The outcome of #2 cannot possibly be returned to the calling application. Whether the PUT succeeded or failed, it always gets back a 2009. Maybe the message was sent and maybe it wasn't. The application probably should take the conservative option, assume it wasn't sent, then resend it. This results in duplicate messages.
Worse is if the application is getting the message. When the channel agent successfully gets the message and can't return it to the client then that message is irretrievably lost. Since the application didn't specify syncpoint, it wasn't MQ that lost the message but rather the application.
This is intrinsic to all types of async messaging. So much so that the JMS 1.1 specification specifically addresses it in 4.4.13 Duplicate Production of Messages which states that:
If a failure occurs between the time a client commits its work on a
Session and the commit method returns, the client cannot determine if
the transaction was committed or rolled back. The same ambiguity
exists when a failure occurs between the non-transactional send of a
PERSISTENT message and the return from the sending method.
It is up to a JMS application to deal with this ambiguity. In some
cases, this may cause a client to produce functionally duplicate
messages.
A message that is redelivered due to session recovery is not
considered a duplicate message.
This can be addressed in part by using syncpoint. Any PUT or GET under syncpoint will be rolled back if the call fails. The application can safely assume that it needs to PUT or GET the message again and no dupes or lost messages will result.
However, there is still the possibility that 2009 will be returned on the COMMIT. At this point you do not know whether the transaction completed or not. If it is 2-phase commit (XA) the transaction manager will reconcile the outcome correctly. But if it is 1-Phase commit, then you are back to not knowing whether the call succeeded or failed.
In the case that the app got a message under syncpoint, it will at least have either been processed or rolled back. This completely eliminates the possibility of losing persistent messages due to ambiguous outcomes. However if the app received a message and gets 2009 on the COMMIT then it may receive the same message again, depending on whether the connection failure occurred in #1 or #3 in the steps above. Similarly, a 2009 when committing a PUT can only be dealt with by retrying the PUT. This also potentially results in dupe messages.
So, short of using XA, any async messaging faces the possibility of duplicate messages due to connection exception and recovery. TCP/IP has become so reliable since MQ was invented that most applications ignore this architectural constraint without detrimental effects. Although that increased reliability in the network makes it less risky to design apps that don't gracefully handle dupes, it doesn't actually address the underlying architectural constraint. That can only be done in code, XA being one example of that. Many apps are written to gracefully handle dupe messages and do not need XA to address this problem.
Note: Paul Clarke (the guy who wrote much of the MQ channel code) is quick to point out that the ambiguity exists when using bindings mode connections. In 20 years of using WMQ I have yet to see a 2009 on a bindings mode connection but he says the shorter path to the QMgr doesn't eliminate the underlying architectural constraint any more so than does the reliable network.)
We have main orchestration that has multiple sub orchestration. All root orchestration is of transaction type:none, hence all the sub are also of same nature. Now any exception is caught in a parent scope of main orchestration and we have some steps like logging. The orchestration is activated with a message from App SQL. So every time an exception occurs, say due to something intermittent, like unable to connect to web service. We later go manually re-trigger.
I'm looking at modifying the orch to be self healing, say from exception catch block it reinitialize the messages based on conditions that tell, the issue was intermittent. Something like app issue-null reference, we would not want to resend message, because, the orch is never going to work.
There is a concept called compensation, but that is for transaction based orch- do n steps if any 1 fails, do m other steps(which would do alternate action or cleanup).
The only idea I have is do a look-up based on keywords in exception and decide to resend messages. But I want some1 to challenge this or suggest a better approach
I have always thought that it's better to handle failures offline. So if the orchestration fails, terminate it. But before you terminate, send a message out. This message will contain all the information necessary to recover the message processing if it turns out that there was a temporary problem which caused the failure. The message can be consumed by a "caretaker" process which is responsible for recovery.
This is similar to how the Erlang OTP framework approaches high availability. Processes fail quickly and caretaker processes make sure recovery happens.
We had a case when exceptions had gone in some kind of infinite loop.
Stack traces were very big and we log all of them.
That flood our Oracle database and when redo logs reached their size limit db stopped.
EDIT: Of course that the most important thing is to find the cause of infinite loop an correct the bug in the system. We already did that and that is not the question here.
The system could have more bugs like that (it's an windows service and it's running constantly) and in that case one app broke the whole DB, meaning all applications on that Oracle DB.
I'm mostly interested in your experiences, architecturally. And that from other logging frameworks like log4net, log4j and others. How do they handle flood of exceptions ? Just handle them like all other exceptions ?
I think your situation illustrates that there should definitely be some mechanism in place to prevent exception logs from causing a denial-of-service anywhere, as this has done.
If you use the Windows event logs, this can be handled for you automatically, as old records can automatically be wiped out when the log is full. You could code a DB-based system to do the same thing, as well.
Of course, you want to do everything you can to eliminate such errors in the first place where ever possible, too!
Another option may be to detect and ignore multiple, consecutive errors of the same time... perhaps simply updating a count property/field instead.
I'd worry more about the root cause of the infinite loop then I would about limiting logging.
I'd check your code for methods that catch an exception, log the stack trace, and re-throw. I'd argue that catching and re-throwing is not exception handling. If a class truly can't handle the exception, it's better to let it simply bubble up until it reaches a single point where someone can deal with it.
Redo logs? How often do you flush those? Surely you don't have one big transaction, do you?
Can you do the logging to a different database with no redo logs? That will protect the production database.
In our applications whe have a central exceptionhandler where all execeptions go through
void OnExceptionOccurs(Exception ex,
string enduserFriendlyContextDescription,
string tecnicalContextDescription,
ILogger loggerBelongingToProcess)
that handler can decide how to log and you have a central location for breakpoint when debugging