I'm looking for a way to create a Zabbix monitor that will only alert when we have a bunch of other alerts on board.
For example, I have created alerts A, B, C, and so on, and when they appear separately, it is not a big deal, but if together, I would like to know and receive a notification to act accordingly.
Therefore, I wonder if it's feasible to design an alert D that only appears when all the others do.
I have only found a solution using dependent triggers, but it's not suitable in such cases.
If the items are in the same host, just add a higher severity trigger when all items breach their threshold.
If the items are in different hosts, you can add a "parent" host, with the metrics you need. Example: in a 8 nodes cluster, each cluster node will have a Warning severity problem when it's offline, and the parent host will have a High severity problem when more than 4 nodes are offline.
What you are looking for is Trigger Dependency. You can update the current trigger and add dependency with other triggers.
In this case your trigger gets suppressed if you depended triggered is already fired.
Related
Cannot find a clean way to set Stackdriver alert notifications on errors in cloud functions
I am using a cloud function to process data to cloud data store. There are 2 types of errors that I want to be alerted on:
Technical exceptions which might cause function to 'crash'
Custom errors that we are logging from the cloud function
I have done the below,
Created a log metric searching for specific errors (although this will not work for 'crash' as the error message can be different each time)
Created an alert for this metric in Stackdriver monitoring with parameters as in below code section
This is done as per the answer to the question,
how to create alert per error in stackdriver
For the first trigger of the condition I receive an email. However, on subsequent triggers lets say on the next day, I don't. Also the incident is in 'opened' state.
Resource type: cloud function
Metric:from point 2 above
Aggregation: Aligner: count, Reducer: None, Alignment period: 1m
Configuration: Condition triggers if: Any time series violates, Condition:
is above, Threshold: 0.001, For: 1 min
So I have 3 questions,
Is this the right way to do to satisfy my requirement of creating alerts?
How can I still receive alert notifications for subsequent errors?
How to set the incident to 'resolved' either automatically/ manually?
I was having a similar problem and managed to at least get a mail every time. The "trick" seems to be to use sum instead of count in combination with for most recent value - see the screenshot below.
This causes Stackdriver to send a mail everytime a matching log entry is found and closing the issue a minute later.
Normally, alerts resolve themselves once the alerting policy stops firing. The problem you're having with your alerts not resolving is because your metric only writes non-zero points - if there are no errors, it doesn't write zero. That means that the policy never gets an unambiguous signal that everything is fine, so the alerts just sit there (they'll automatically close after 7 days, but I imagine that's not all that useful for you).
This is a common problem and it's a tricky one to solve. One possibility is to write your policy as a ratio of errors to something non-zero, like request count. As long as the request count is non-zero, the ratio will compute zero if there are no errors, and so an alert on the ratio will automatically resolve. You need to be a bit careful about rounding errors, though - if your request count is high enough, you might potentially miss a single error because the ratio could round to zero.
Aaron Sher, Stackdriver engineer
We got around this issue by having the insertId as a label of the log-based metric we created for every log record we get from the pods running our services.
In the alerting policy, this label helped in two things:
We grouped by it (named as record_id) which served in making each incident unique, so it got reported without waiting for other incidents to get resolved and at the same time it got resolved instantly.
We used it in the documentation of the notification to include a direct link to the issue (log record) itself which was a nice and essential feature to have. https://console.cloud.google.com/logs/viewer?project=MY_PROJECT&advancedFilter=insertId%3D%22${metric.label.record_id}%22
As #Aaron Sher mentioned in his answer, it is a tricky problem. We might have done something not recommended or not efficient, but it works fine and of course we are open for improvement recommendations.
I have a question about fiware-skuld.
Is It working Skuld within a federation?
Must be use globally or in each FIWARE Lab region?
It is not a good idea to run individually Skuld on each region. There are some
serious problems:
the users are global. The change of the user type (from Trial Users to Basic Users type) can be invoked only one time. The same is true for the notifications. Users do not want a
notification for each region.
there is a problem of synchronisation if each region delete their
resources when they want. Users must be notified only one time and
with a defined anticipation.
At this moment the scripts are invoked only for a region, but to support a
federation it is sufficient to modify only the scripts that delete resources to
iterate with each region.
When using distributed and scalable architecture, eventual consistency is often a requirement.
Graphically, how to deal with this eventual consistency?
Users are used to click save, and see the result instantaneously... with eventual consistency it's not possible.
How to deal with the GUI for such scenarios?
Please note the question applies both for desktop applications and web applications.
PS: I'm working with the Microsoft platform, but I imagine the question applies to any technology...
A Task Based UI fits this model great. You create and execute tasks from the UI. You can also have something like a task status monitor to show the user when a task has executed.
Another option is to use some kind of pooling from the client. You send the command, and pool from the client until the command completed and the new data is available. You will have a delay in some cases from when the user presses save to when he will see the new record, but in most cases it should be almost synchronous.
Another (good?) option is to assume/design commands that don't fail. This is not trivial but you can have a cache on the client and add the data from the command to that cache and display it to the user even before the command has been executed. If the command fails for some unexpected situation, well then just design a good "we are sorry" message for misleading the user for a few seconds.
You can also combine the methods above.
Usually eventual consistency is more of a business/domain problem, and you should have your domain experts handle it.
I think that other answers mix together CQRS in general and eventual consistency in particular. Task-based UI is very suitable for CQRS but it does not resolve the issue with eventually consistent read model.
First, I would like to challenge your statement:
Users are used to click save, and see the result instantaneously... with eventual consistency it's not possible.
What do you by this? Why is it not possible to see the result immediately? I think the issue here is your definition of result.
The result of any action is that that action has been performed. There are numerous of ways to show this! It depends on what kind of action do you want to complete. Examples:
Send an email: if user has entered a correct email address, it is almost guaranteed that the action will complete successfully. To prevent unexpected failures one might use durable queues since this kind of actions do not need to be done synchronously. So you just say "email sent". Typically you see this kind of response when you ask to reset your password.
Update some information in a user profile: after you have validated the new data on the client, most probably the command will succeed too since the only thing that could happen is the database error (if you use database). Again, even this can be mitigated by using durable queues. In this case you just show the updated field in the same form. The good practice for SPA is to have a comprehensive data store on the client side, like Redux does. In this case you can safely update the server by sending a command and also updating the client-side store, which will result in UI to shows the latest data. Disclaimer: some answers refer to this technique as "tricking the user", but I disagree with this definition.
If you have commands that are prone to error, you can use techniques that are already described in other answers like Websockets or Server-side events to communicate errors back. This requires quite a lot of additional work. You can also send a command and wait for reply or execute commands synchronously. Some would say "this is not CQRS" but this would be just another dogma to be challenged. Ensuring the command has completed the execution in combination with the previous point (client-side data store) will be a good solution.
I am not sure if there is any 100% bullet proof technique that allows you to always show non-stale data from the read model. I think it goes against the principles of CQRS. Even with real-time events you will only get events that indicate that you write model has been updated. Still, your projections could have failed and reacting on this is a whole other story.
However, I would not concentrate that much on this issue. The fact is that well-tested projections and almost-guaranteed commands will work very well. For error handling in 90% of situations it is enough to have some manual or half-manual process to recover from those errors. For the last 10% you can combine generic "error" messages pushed from the server saying "sorry, your action XXX has failed to execute" and the top priority actions could have some creative process behind them but in reality those situations would be very very rare.
There are 2 ways:
To trick a user (just to show that things has happened then they
really hasn't happened yet)
Show that system is processing request
and use polling in background (not good) or just timer with value of
your SLA.
I prefer the 1st option.
As someone has already mentioned, task based UI's fit well for this, and what I would do is employ a technique that 'buys you time' for the command to propagate.
For example, imagine we are on a list screen, where the user can perform various actions, one of which being to add a new item to the list. After choosing to add an item you could display a "What would you like to do next?" which could have 'Add another item', 'Do this task', 'Do some other task', 'Go back to list'.
By the time they have clicked on an option, the data would have hopefully been refreshed.
Also, if you're using a task based UI, you can analyse the patterns of task execution and use these "what would you like to do next" screens to streamline the UI. Similar to amazon's "other people also bought these items".
As previously stated, it is fine to tell the user that the request (command) has been acknowledged (successfully issued). In case of some failure, the system should communicate this to the requester, by means of:
email;
SMS;
custom inbox (e.g. like the SO inbox);
whatever.
E.g., mail client / service:
I am sending a mail to a wrong address;
the mail service says: "email sent successfully :)";
after few minutes, I receive a mail from the service: "email could not be delivered".
I believe a great way to inform the user about a recent failure is to present him an error panel while he's navigating through the application. A user gesture might be required in order to dismiss that alert etc.
For example:
I wouldn't go with tricking the user or blocking him from committing some other actions. I would rather go for streaming data toward UI after they are being acknowledged by a read side. Let's consider these two cases:
Users saves data and expects result. Connection is established toward server. After they are being acknowledged by a read side, they are streamed toward UI and UI is being updated.
User saves data and refreshes web page. Upon reload, data are being fetched from data store and connection for streaming is established. If read side didn't update the data store in the meantime, there's still an opened stream and UI should be updated after data reaches the read side.
Why streaming from read side and not directly from write side? Simply, that would be a confirmation that read side has been reached.
From technical aspect, Server-Sent Events could be used.
Disadvantage:
Results will still not be reflected immediately by a read side. But at least, in most cases, user will be able to continue with his work without being blocked by a UI.
There are several ways to handle eventual consistency. All of them are really to occupy the time from the User's action until the backend refresh.
User Reads A given user can only read from the same database node that they write to. Other users read from the replicated nodes. PROS: UI is quick enough, and application stays in sync. CONS: Your service architecture has to track and route Users to specific database nodes.
Disable the UI until the action has completed, and refresh it. Java Server Faces has a classic example of this. One could create a modal with a loading spinner to cover the UI until the refresh was completed. PROS: UI stays in sync with application state. CONS: Most every action creates a blocked UI. Users get very frustrated by the restricted UI, and will complain of application slowness.
Confirmation Immediately thank the user for their submission. Then let them know later (email, SMS, in-app notification) whether or not the action was completed. PROS: It's fast up front. CONS: UI lags behind system until refresh. Even with a notice, the User may get confused that they don't see the updates. It also requires integration of various communication channels. Users won't see their changes right away. If the action fails, they may not know until it's too late.
Fake it Optimistically assume that the action will complete. Show the User the resulting UI (upvote, comment, credit card confirmation, etc) and allow them to continue as if it succeeded. If there were failures, immediately show them as contextual errors: alerts next to the undone upvotes, in-app alert on the post with the failed comment, email for the declined credit card. PROS: UI feels much faster. CONS: UI is temporarily out of sync with application state, and you must resolve that. One case: you might fake creation of content with temp IDs. But after content is created, then the temp IDs will be wrong until the refresh. Second case, you might need to store all state changes on the UI after the action until the refresh. Then you need some Resolver to apply all the local state changes since the action was issued. This is resolution is non-trivial.
Web Sockets Subscribe the UI to an event stream so that when the action is completed on the backend, it is pushed to the front end. Is it one-way or two-way streaming? PROS: UI feels fast, and it's in sync with the application state. CONS: Consistent browser support, need a backend source of streaming events, and socket server scalability.
I use some xbee (s2) modules with zb stack for mesh networking evaluation. Therefore a multi hopping environment has to be created. The problem is, that the firmware handles the association for themselves and there is no way deeper into the stack as the api provides. To force the path of the data, without to disturb the routing mechanism, I have tried to measure, I had to put them outside their reach. To get only the next hop in association isn't that easy. I used the least power level of the output, but the distance for the test setup is to large and the rf characteristics of the environment change undetermined.
Therefore my question, has anyone experience with this issue?
Regards, Toby
I don't think it's possible through software and coordinator/routers. You could change the Node Join Time (ATNJ) to force a new router to join through a particular router (disable Node Join on all nodes except one), but that would only affect joining. Once joined to the network, the router will discover that other nodes are within range.
You could possibly do it with sleepy end devices. You can use the ATNJ trick to force an end device to join through a single router, and it will always send its messages to that router. But you won't get that many hops -- end device sends to its parent router, which sends to the target's parent router, which sends to the target end device.
You'll likely need to physically limit the range of the radios to force hopping, as demonstrated in the video you linked of Digi's K-Node test equipment with a network of over 1000 radios. They're putting the radios in RF-shielded boxes and using wired antenna connections with software-controlled attenuators to connect the modules to each other.
If you have XBee modules with the U.fl or RPSMA connector, and don't connect an antenna, it should significantly reduce the range of the module. Otherwise, with a wire whip or integrated PCB antenna, you need to put each radio in some sort of box that attenuates the signal. Perhaps someone else can offer advice on materials that will reduce the signal's range without completely blocking it.
ZigBee nodes try to automatically form an Ad-Hoc network. That is why they join the network with the strongest connection (best network coverage) available on that moment. These modules are designed in such a way, that you do not have to care much about establishing a reliable communication. They will solve networking problems most of the time.
What you want to do, is somehow force a different situation. You want to create a specific topology, in order to get some multi-hopping. That will not be the normal behavior of the nods. But you can still get what you want with some of the AT Commands.
The mentioned command "NJ" should work for you. This command locks joins after a certain time (in seconds). Let us think of a simple ZigBee network with three nodes: one Coordinator, one Router and one End-Device. Switch on the Coordinator with "NJ" set to, let us say, two minutes. Then quickly switch on the Router, so it can associate with the Coordinator within these two minutes. After these two minutes, the Coordinator will be locked and will not accept more joins. At that moment you can start the End-Device, which will have to associate with the Router necessarily. This way, you will see that messages between End-Device and Coordinator go through the Router, as you wanted.
You may get a bigger network applying this idea several times, without needing to play with the module's antennas. You can control the AT Parameters remotely (i.e. from a Computer connected to the Coordinator), so you can use some code to help you initialize the network.
On a wiki-style website, what can I do to prevent or mitigate write-write conflicts while still allowing the site to run quickly and keeping the site easy to use?
The problem I foresee is this:
User A begins editing a file
User B begins editing the file
User A finishes editing the file
User B finishes editing the file, accidentally overwriting all of User A's edits
Here were some approaches I came up with:
Have some sort of check-out / check-in / locking system (although I don't know how to prevent people from keeping a file checked out "too long", and I don't want users to be frustrated by not being allowed to make an edit)
Have some sort of diff system that shows an other changes made when a user commits their changes and allows some sort of merge (but I'm worried this will hard to create and would make the site "too hard" to use)
Notify users of concurrent edits while they are making their changes (some sort of AJAX?)
Any other ways to go at this? Any examples of sites that implement this well?
Remember the version number (or ID) of the last change. Then read the entry before writing it and compare if this version is still the same.
In case of a conflict inform the user who was trying to write the entry which was changed in the meantime. Support him with a diff.
Most wikis do it this way. MediaWiki, Usemod, etc.
Three-way merging: The first thing to point out is that most concurrent edits, particularly on longer documents, are to different sections of the text. As a result, by noting which revision Users A and B acquired, we can do a three-way merge, as detailed by Bill Ritcher of Guiffy Software. A three-way merge can identify where the edits have been made from the original, and unless they clash it can silently merge both edits into a new article. Ideally, at this point carry out the merge and show User B the new document so that she can choose to further revise it.
Collision resolution:
This leaves you with the scenario when both editors have edited the same section. In this case, merge everything else and offer the text of the three versions to User B - that is, include the original - with either User A's version in the textbox or User B's. That choice depends on whether you think the default should be to accept the latest (the user just clicks Save to retain their version) or force the editor to edit twice to get their changes in (they have to re-apply their changes to editor A's version of the section).
Using three-way merging like this avoids lock-outs, which are very difficult to handle well on the web (how long do you let them have the lock?), and the aggravating 'you might want to look again' scenario, which only works well for forum-style responses. It also retains the post-respond style of the web.
If you want to Ajax it up a bit, dynamically 3-way merge User A's version into User B's version while they are editing it, and notify them. Now that would be impressive.
In Mediawiki, the server accepts the first change, and then when the second edit is saved a conflicts page comes up, and then the second person merges the two changes together. See Wikipedia: Help:Edit Conflicts
Using a locking mechanism will probably be the easiest to implement. Each article could have a lock field associated with it and a lock time. If the lock time exceeded some set value you'd consider the lock to be invalid and remove it when checking out the article for edit. You could also keep track of open locks and remove them on session close. You'd also need to implement some concurrency control in the database (autogenerated timestamps, perhaps) so that you could make sure that you are checking in an update to the version that you checked out, just in case two people were able to edit the article at the same time. Only the one with the correct version would be able successfully check in an edit.
You might also be able to find a difference engine that you could just use to construct differences, though displaying them in a wiki editor may be problematic -- actually displaying the differences is probably harder than constructing the diff. You'd rely on the versioning system to detect when you needed to reject an edit and perform a diff.
In Gmail, if we are writing a reply to a mail and someone else sends a reply while we are still typing it, a popup appears indicating that there is a new update and the update itself appears as another post without a page reload. This approach would suit your needs and if you can use Ajax to show the exact post with a link to diff of what was just updated while User B is still busy typing his entry that would be great.
As Ravi (and others) have said, you could use an AJAX approach and inform the user when another change is in progress. When an edit is submitted, just indicate the textual differences and let the second user work out how to merge the two versions.
However, I'd like to add on with something new you could try in addition to that: Open a chat dialog between the editors while they're doing their edits. You could use something like embedded Gabbly for that, for instance.
The best conflict resolution is direct dialog, I say.
Your problem (lost update) is solved best using Optimistic Concurrency Control.
One implementation is to add a version column in each editable entity of the system. On user edit you load the row and display the html form on the user. A hidden field gives the version, let's say 3. The update query needs to look something like:
update articles set ..., version=4 where id=14 and version=3;
If rows returned is 0 then someone has already updated article 14. All you need to do then is how to deal with the situation. Some common solutions:
last commit wins
first commit wins
merge conflicting updates
let the user decide
Instead of an incrementing version int/long you can use a timestamp but it's not suggested because:
retrieving the current time from the JVM isn't necessarily safe in a clustered environment, where nodes may not be time synchronized.
(quote from Java Persistence with Hibernate)
Some more info at the hibernate documentation.
At my office, we have a policy that all data tables contain 4 fields:
CreatedBy
CreatedDate
LastUpdateBy
LastUpdateDate
That way there is a nice audit trail on who has done what to the records, at least most recently.
But most importantly, it becomes easy enough to compare the LastUpdateDate of the current or edited record on the screen (requires you to store it on the page, in a cookie, whatever, with the value in the database. If the values don't match, you can decide what to do from there.