Postfix queue keeps growing - smtp

im having some trouble with our mail server since yesterday.
First, the server was down for couple days, thanks to KVM, VMs were paused because storage was apparently full. So i managed to fix the issue. But since the mail server is back online, CPU usage was always at 100%, i checked logs, and there was "millions", of mails waiting in the postfix queue.
I tried to flush the queue, thanks to the PFDel script, it took some times, but all mails were gone, and we were finally able to receive new emails. I also forced a logrotate, because fail2ban was also using lots of CPU.
Unfortunately, after couple hours, postfix active queue is still growing, and i really dont understand why.
Another script i found is giving me that result right now:
Incoming: 1649
Active: 10760
Deferred: 0
Bounced: 2
Hold: 0
Corrupt: 0
is there a possibility to desactivate ""Undelivered Mail returned to Sender" ?
Any help would be very helpful.
Many thanks

You could firstly temporarily stop sending bounce mails completely or set more strict rules in order to analyze the reasons of the flood. See for example:http://domainhostseotool.com/how-to-configure-postfix-to-stop-sending-undelivered-mail-returned-to-sender-emails-thoroughly.html
Sometimes the spammers find some weakness (or even vulnerability) in your configuration or in SMTP server and using that to send the spam (also if it could reach the addressee via bounce only). Mostly in this case, you will find your IP/domain in some common blacklist services (or it will be blacklisted by large mail providers very fast), so this will participate additionally to the flood (the bounces will be rejected by recipient server, what then let grow you queue still more).
So also check your IP/domain using https://mxtoolbox.com/blacklists.aspx or similar service (sometimes they provide also the reason why it was blocked).
As for fail2ban, you can also analyze logs (find some pattern) to detect the evildoers (initial sender), and write custom RE for fail2ban to ban them for example after 10 attempts in 20 minutes (or add it to ignore list for bounce messages in postfix)... so you'd firstly send X bounces, but hereafter it'd ban the recidive IPs, what could also help to reduce the flood significantly.
An last but not least, check your config (follow best practices for it) and set up at least MX/SPF records, DKIM signing/verification and DMARC policies.

Related

How do multiple developers use the same queue for development?

We use SQS for queueing use-cases in our company. All developers connect to the same queue for local development. If we're producing some messages for testing in local development, it can happen that the message is consumed on other person's locally running consumer, if that person has the app running at the same time.
How do you make sure that messages produced by one person don't end up getting lost by consumption on other person's locally running consumer. Is using different different queues for each person the only solution? Wondering what is standard followed to avoid this in the industry?
This is very open-ended IMO. Would recommend adding some context as to how you're using SQS.
But from what I could understand:
Yes, I would recommend creating queues per "developer"
OR
Although not elegant, you can maybe add an SQS message attribute (this is metadata other than message body) with a developer's username.
And each developer should then only process a message if it's meant for them. Arguably, you could also add a flag in the message itself, but, I am not sure about the constraints on your message format. Message attributes are meant to be used for these situations, where you want to know if you really need to process a message before even parsing the message body.
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-message-metadata.html#sqs-message-attributes
But you'll have to increase the maxReceives to a high number (so that message does not move to dead letter queue, if you have configured one). This is not exhaustive, it will just decrease the chances of your messages being deleted by someone else. Because if say, 10 people read the message and did not delete it because username was not their username, and maxReceives is 8, it will still move to DLQ and cause unnecessary confusion.

What does Chrome Network Timings really mean and what does affects each timing length?

I was looking at chrome dev tools #resource network timing to detect requests that must be improved. In the link before there's a definition for each timing but I don't understand what processes are being taken behind the scenes that are affecting the length of the period.
Below are 3 different images and here is my understanding of what's going on, please correct me if I'm wrong.
Stalled: Why there are timings where the request get's stalled for 1.17s while others are taking less?
Request Sent: it's the time that our request took to reach server
TTFB: Time took until the server responds with the first byte of data
Content Download: The time until the whole response reaches the client
Thanks
Network is an area where things will vary greatly. There are a lot of different numbers that go into play with these and they vary between different locations and even the same location with different types of content.
Here is some more detail on the areas you need more understanding with:
Stalled: This depends on what else is going on in the network stack. One thing could not be stalled at all, while other requests could be stalled because 6 connections to the same location are already open. There are more reasons for stalling, but the maximum connection limit is an easy way to explain why it may occur.
The stalled state means, we just can't send the request right now it needs to wait for some reason. Generally, this isn't a big deal. If you see it a lot and you are not on the HTTP2 protocol, then you should look into minimizing the number of resources being pulled from a given location. If you are on HTTP2, then don't worry too much about this since it deals with numerous requests differently.
Look around and see how many requests are going to a single domain. You can use the filter box to trim down the view. If you have a lot of requests going off to the same domain, then that is most likely hitting the connection limit. Domain sharding is one method to handle this with HTTP 1.1, but with HTTP 2 it is an anti-pattern and hurts performance.
If you are not hitting the max connection limit, then the problem is more nuanced and needs a more hands-on debugging approach to figure out what is going on.
Request sent: This is not the time to reach the server, that is the Time To First Byte. All request sent means is the request is sent and it took the network stack X time to carry it out.
Nothing you can do to speed this up, it is more for informational and internal debugging purposes.
Time to First Byte (TTFB): This is the total time for the sent request to get to the destination, then for the destination to process the request, and finally for the response to traverse the networks back to the client.
A high TTFB reveals one of two issues. The first is a bad network connection between the client and server. So data is slow to reach the server and get back. The second cause is, a slow server processing the request. This is either because the hardware is weak or the application running is slow. Or, both of these problems can exist at once.
To address a high TTFB, first cut out as much network as possible. Ideally, host the application locally on a low-resource virtual machine and see if there is still a big TTFB. If there is, then the application needs to be optimized for response speed. If the TTFB is super-low locally, then the networks between your client and the server are the problem. There are various ways to handle this that I won't get into since it is an area of expertise unto itself. Research network optimization, and even try moving hosts and seeing if your server providers network is the issue.
Remember the entire server-stack comes into play here. So if nginx or apache are configured poorly, or your database is taking a long time to respond, or your cache is having trouble, then these can cause delays. They are also difficult to detect locally, since your local server could vary in configuration from the remote stack.
Content Download: This is the total time from the TTFB resolving for the client to get the rest of the content from the server. This should be short unless you are downloading a large file. You should take a look at the size of the file, the conditions of the network, and then judge about how long this should take.

Tibco RV: Message Lifetime + No Copy for fanouts

1, How long does a message live in the listener queue? Until the dispatcher reads the message out of the queue in a "1 publisher 1 consumer" scenario?
Listener listener = new Listener(Queue.Default, transport, subject, new object());
listener.MessageReceived += OnMessageReceived;
Dispatcher dispatcher = new Dispatcher(listener.Queue);
2, Tibco RV is typically used in a large fanout system, with relatively loose requirements on delivery reliability, say, market data published to 20 applications in an enterprise. I've heard Tibco RV implements a "no copy" solution for fanouts - how is that even possible? I'd assume we need to at least go through all registered listeners for that queue and notify each of them, in which process the message is copied 20 times. Please enlighten me.
3, Combine question 1 and 2, it doesn't make sense for a message to live in the listener queue until ALL registered listeners have consumed the message - What happens if 1 of the 20 applications goes offline? It's going to bring down the rv daemon process due to ever increasing messages. Does Tibco RV have a lifetime limit(ttl) for each message? How do I check it and set it to new values?
There isn't much related info on Google, so please help.
Thanks.
Great questions.
Keep in mind that unless you are using RV certified messaging there will be no persistence to disk. Messages sent will remain in the memory of the sending Rendezvous daemon until they are delivered to all consumers.
That said, another thing to understand is that RV is an optimistic protocol as opposed to say, TCP which is a pessimistic protocol. Every packet is sent using TCP must be acknowledged. This round-trip protocol slows things down. Rendezvous on the other hand uses a protocol that sits on top of UDP which is session-less and does not require acks. Therefore, when a Rendezvous daemon sends a message, it is assumed to have been delivered successfully unless a retransmission request is received. So to completely answer your first question the default behavior of a Rendezvous daemon is to keep messages that it has sent in memory for 60 seconds after it has sent the message. This allows it to honor retransmission requests.
Fan out is achieved using broadcast or multicast protocols on top of UDP. The use of broadcast is discouraged and multicast is encouraged. Using multicast groups uses considerably less network resources. At the network interface level only those hosts that have joined the multicast group will receive packets associated with the Rendezvous traffic. Similarly at the network switch level multicast is a lot less resource-intensive.
Bottom line is that the sending Rendezvous daemon only sends the message out once, and the network delivers a copy of the associated packets to either each host on the subnet if broadcast is used, or the hosts that have registred interest if multicast is used.
In pub-sub, typically consumers receive messages that are sent while they are alive and consuming. Thus with pure Rendezvous, if a consumer was to go down, the subscription would be cancelled for that consumer. If we think about your example of market data, this is exactly the behavior we want. IBM trades at thousands of times a second, so if I miss a price quote it's no big deal. I'll get the next one. Furthermore, I don't want stale prices.
That said, sometimes consumers do want messages sent while they were down. This can be achieved using certified messaging and setting up persistent correspondents. For more on this see the Rendezvous Concepts Guide. Lastly, the 60-second behavior that I mentioned in point #1 can be changed using the -reliability parameter when starting the Rendezvous daemon. There are some cases where this may make sense (although the default is best for most common cases). For more details on this take a look at the Rendezvous Admin Guide.

Simple scalable work/message queue with delay

I need to set up a job/message queue with the option to set a delay for the task so that it's not picked up immediately by a free worker, but after a certain time (can vary from task to task). I looked into a couple of linux queue solutions (rabbitmq, gearman, memcacheq), but none of them seem to offer this feature out of the box.
Any ideas on how I could achieve this?
Thanks!
I've used BeanstalkD to great effect, using the delay option on inserting a new job to wait several seconds till the item becomes available to be reserved.
If you are doing longer-term delays (more than say 30 seconds), or the jobs are somewhat important to perform (abeit later), then it also has a binary logging system so that any daemon crash would still have a record of the job. That said, I've put hundreds of thousands of live jobs through Beanstalkd instances and the workers that I wrote were always more problematical than the server.
You could use an AMQP broker (such as RabbitMQ) and I have an "agent" (e.g. a python process built using pyton-amqplib) that sits on an exchange an intercepts specific messages (specific routing_key); once a timer has elapsed, send back the message on the exchange with a different routing_key.
I realize this means "translating/mapping" routing keys but it works. Working with RabbitMQ and python-amqplib is very straightforward.

How can I delay mail delivery through an SMTP relay, possibly sendmail

I have a requirement to delay mail delivery through an SMTP Relay.
i.e.
Mail message is successfully recieved at time T.
Forward Message to destination at time T+4hours.
Is this possible in sendmail or any other SMTP Relay.
Deployment platform is IBM AIX.
You should've been at least a little more specific in your question. I'll just throw in some suggestions anyway.
If you just want to deliver mail every four hours, you have to run sendmail in queue-only mode (QUEUE_MODE="cron"; in sendmail.conf) and set up the queue to be run every four hours (QUEUE_INTERVAL="4h";). I think, this only applies to debian-like systems, but the principle is the same anywhere - you set the queue mode to cron (this is actually controlled by the arguments, with which you start sendmail) and then you process it periodically.
If you want to just delay mail delivery, there is also a number of ways to do it, depending on why you want to do it. One popular solution is greylisting, it does just the following - when a host connects to your MTA (sendmail, f.ex.), it gets bounced with the prompt to try again in some time interval. A properly configured mailer will just do that - it will try sending the mail again and eventually the message will be accepted and delivered (or forwarded). Most of the spam bots, on the other hand, will not try to resend the message upon receiving an error. If you need greylisting on sendmail you can read up here: http://www.greylisting.org/implementations/sendmail.shtml
Hope this helped at least a bit.
EDIT:
Ok, so now I see what you need doing. Here is the possible solution using sendmail (I've been dealing with sendmail in one way or another for years now, so.. :P): You use two of them.
The first one just receives mail and queues it and (and it is important) it does NOT get to process the queue. The second sendmail instance runs a separate queue and its QUEUE_MODE is set to daemon or cron (say, every minute). Now all you need is to write an external script, that would move the mail from the first queue to the second, once the "age" of the message is reached. Since queue items are just files, it is an easy task, done in a few lines of, say, perl (hell, a shell script can do that, too). Moving queue items from queue to queue is as easy as moving files from directory to directory. Please note, that this technique is widely used in mail processing solutions, such as, say spamassassin, so its not some weirdness, conjured by my deseased mind :P
Hope this gives you a hint or two.