Logback reliability - logback

Log4j is not reliable as per the following faq:
http://logging.apache.org/log4j/1.2/faq.html#a1.2
"No. log4j is not reliable. It is a best-effort fail-stop logging system."
Is Logback more reliable? Is it possible that when 1000 log messages (for example) are written using logback in a very short span of time, it might silently miss a few messages.
Thanks,
Sunil

I think that Logback is also a best-effort fail-stop logging system. Run this snippet:
for (int i = 0; i < 8; i++) {
System.out.println("log " + i);
logger.info("log {}", i);
Thread.sleep(2000);
}
with a FileAppender:
<appender name="file" class="ch.qos.logback.core.FileAppender">
<file>/mnt/logtest/testlog.log</file>
<append>false</append>
<encoder>
<pattern>%d [%thread] %level %mdc %logger{35} - %msg%n</pattern>
</encoder>
</appender>
on a disk which has no free space. Then it runs without any errors. A few seconds later I've deleted some files from the disk.
The contents of the testlog.log file was that:
2011-10-07 08:19:01,687 [main] INFO logbacktest.LoopLog - log 5
2011-10-07 08:19:03,688 [main] INFO logbacktest.LoopLog - log 6
2011-10-07 08:19:05,688 [main] INFO logbacktest.LoopLog - log 7
There is no log 0 - log 4 lines in the file. I wouldn't think that other appenders more reliable.
In normal operating conditions (e.g. system has enough disk space) I've never seen that Logback lost a message. In that meaning I think it's reliable. But if you want to do audit logging I think you should use something else, not a best-effort fail-stop logging system. (If an attacker found a way to disable the logging by filling the disk space he can do everything on the user interface without any audit log and notice that the disk was full.)

The short answer is No, logback is not reliable. Even It is very very unreliable if you benchmark it with other logging frameworks such as log4j, log4j2, JUL and etc.
In fact, Logback has the highest performance but drops some of its log events in order to do so. This is due to the behavior of Logback’s AsyncAppender, which drops events below the WARNING level if the queue becomes 80% full.
There’s often a trade-off between fast and reliable logging. Logback, in particular, maximized performance by dropping a larger number of events, especially when we used an asynchronous appender. Log4j 1.2.17 and 2.3 tended to be more conservative but couldn’t provide nearly the same performance gains.

Related

rabbitmq showing wrong disk free limit in management console

as the title says, I have a problem, that the rabbitmq shows (and thinks) that there is more space available, as I told him.
I'm running 2 instances of rabbitmq 3.8.8 with erlang 23.0 in 2 RHEL pods. To these pods, a dynamically provisioned PersistentVolume is bound of 2GB size on NFS.
That means, that every pod shoud have 1GB of space for himself.
In the rabbitmq.conf I have the following:
vm_memory_high_watermark.relative = 0.9
total_memory_available_override_value = 1000MB
disk_free_limit.absolute = 1GB
management.load_definitions = /etc/rabbitmq/definitions.json
Also when I start Rabbitmq, I see in the log, that the configuration is read correctly:
2020-10-13 08:26:51.726 [info] <0.427.0> Memory high watermark set to 858 MiB (900000000 bytes)
2020-10-13 08:26:51.811 [info] <0.439.0> Enabling free disk space monitoring
2020-10-13 08:26:51.811 [info] <0.439.0> Disk free limit set to 1000MB
the problem is, that rabbitMQ somehow thinks, that there is the whole NFS free space available - 54GB (as on the screenshot above). So I got into a problem, that over 200K messages were stuck in one of the queues, filled up those 2GB of PersistentVolume I gave him, but didn't stopped accepting messages, cause he thought, that there is more space available. Of course, the whole rabbitmq pod crashed, cause it couldn't write more messages to the NFS.
can you please guide me, how to set is correctly?
Or do you know, why rabbitMQ doesnt respect the disk_free_limit.absolute value?
many thanks
rabbitmq-diagnostics environment | grep disk_free_limit
will display the actual effective configuration value.
On Linux, RabbitMQ will use either a configured absolute value, or compute how much disk space its data directory's partition has by running
df -kP /path/to/directory
which is not aware of Kubernetes quotas.
I don't have an NFS partition on Kubernetes to try but a basic test with the following rabbitmq.conf file
disk_free_limit.absolute = 3GB
does not reproduce; the configured value is used as expected. See 1.
Regarding your question : "why rabbitMQ doesnt respect the disk_free_limit.absolute value" - I think it does (even if it treats the free memory wrong of the k8s / pod).
the value is shown in the image you attached as '954 MiB low watermark' - that means that when you'll have only 1 GB of disk free space the broker will stop publishers from publishing and will only allow consumers to consume until there's more space available on disk.
so as long as the machine has more than 1 gb available it continues to accept messages.
perhaps since it wrongly reads that it has 54 GB of free space it crashes but the disk_free_limit.absolute value seems to be read right.

EsRejectedExecutionException in elasticsearch for parallel search

I am querying elasticsearch for multiple parallel requests using single transport client instance in my application.
I got the below exception for the parallel execution. How to overcome the issue.
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.search.action.SearchServiceTransportAction$23#5f804c60
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:62)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
at org.elasticsearch.search.action.SearchServiceTransportAction.execute(SearchServiceTransportAction.java:509)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteScan(SearchServiceTransportAction.java:441)
at org.elasticsearch.action.search.type.TransportSearchScanAction$AsyncAction.sendExecuteFirstPhase(TransportSearchScanAction.java:68)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:171)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.start(TransportSearchTypeAction.java:153)
at org.elasticsearch.action.search.type.TransportSearchScanAction.doExecute(TransportSearchScanAction.java:52)
at org.elasticsearch.action.search.type.TransportSearchScanAction.doExecute(TransportSearchScanAction.java:42)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:107)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:43)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
at org.elasticsearch.action.search.TransportSearchAction$TransportHandler.messageReceived(TransportSearchAction.java:124)
at org.elasticsearch.action.search.TransportSearchAction$TransportHandler.messageReceived(TransportSearchAction.java:113)
at org.elasticsearch.transport.netty.MessageChannelHandler.handleRequest(MessageChannelHandler.java:212)
at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:109)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Elasticsearch has a thread pool and a queue for search per node.
A thread pool will have N number of workers ready to handle the requests. When a request comes and if a worker is free , this is handled by the worker. Now by default the number of workers is equal to the number of cores on that CPU.
When the workers are full and there are more search requests, the request will go to queue. The size of queue is also limited. If by default size is, say, 100 and if there happens more parallel requests than this, then those requests would be rejected as you can see in the error log.
Solutions:
The immediate solution for this would be to increase the size of
the search queue. We can also increase the size of threadpool,
but then that might badly affect the performance of individual
queries. So, increasing the queue might be a good idea. But then
remember that this queue is memory residential and increasing the
queue size too much can result in Out Of Memory issues. (more
info)
Increase number of nodes and replicas - Remember each node has its
own search threadpool/queue. Also, search can happen on primary
shard OR replica.
Maybe it sounds strange, but you need to lower the parallel searches count. With that exception, Elasticsearch tells you that you are overloading it. There are some limits (at thread count level) that are set in Elasticsearch and, most of the times, the defaults for these limits are the best option. So, if you are testing your cluster to see how much load it can hold, this would be an indicator that some limits have been reached.
Alternatively, if you really want to change the default you can try increasing the queue size for searches to accommodate the concurrency demands, but keep in mind that the larger the queue size, the more pressure you put on your cluster that, in the end, will cause instability.
I saw this same error because I was sending lots of indexing requests in to ES in parallel. Since I'm writing a data migration, it was easy enough to make them serial, and that resolved the issue.
I don't know what was your node configuration but your queue size (1000) is already on a higher side. As others have explained already, your search requests are queued in the Elasticsearch thread pool queue. Even after such a high queue size, if you are getting rejections, that gives some hint that you need to revisit your query pattern.
Like many other designs, even in this case, there is no one-size-fits-all solution. I found this is a very good post about how this queue works and different ways to do a performance test to find out what suits best for your use case.
HTH!

Synchronous External Abort on ARM

I was building a bare metal application on ARM Cortex A9 Pandaboard, and I got Instruction Fetch Abort frequently. When I dump IFSR Register I got 0x1008. I've read the reference manual, and I understand that 1008 was Synchronous External Abort. The problem is what synchronous external abort means and where does it come from? Thanks for your help.
The ARMv7 ARM section "VMSA Memory aborts" covers this as thoroughly as one would expect (given that it's the authoritative definition of the architecture), but to summarise in slightly less than 14 pages;
An abort means the CPU tried to make a memory access, which for whatever reason, couldn't be completed so raises an exception.
An external abort is one from, well, externally to the processor, i.e. something on the bus. In other words, the access didn't fault in the MMU, went out onto the bus, and either some device or the interconnect itself came back and said "hey, I can't deal with this".
A synchronous external abort means you're rather fortunate, in that it's not going to be utterly hideous to debug - in the case of a prefetch abort, it means the IFAR is going to contain a valid VA for the faulting instruction, so you know exactly what caused it. The unpleasant alternative is an asynchronous external abort, which is little more than an interrupt to say "hey, something you did a while ago didn't actually work. No I don't know what is was either."
So, you're trying to execute instructions from something that you think is memory, but isn't. Without any further details, the actual cause could be anything from a typoed hard-coded address, to dodgy page tables, stale TLB entries, cache coherency, etc. etc.

c3p0 piling Connection objects

While c3p0 removes a connection after the maxIdleTime, it adds it to an internal weakHashMap named formerResources in BasicResourcePool. This map is getting piled up on the heap with JDBC4Connection objects and gets cleared only on GC. Is it possible to opt out for such collection or is there any clear advantage of such collection?
the purpose of formerResources is just to permit checkins of c3p0 resources to be idempotent while at the same time check-ins of foreign resources -- Connections never before seen by the pool -- provoke a warning.
in practice, for most c3p0 users, it is hard to check-in a foreign resource at this level. the BasicResourcePool is buried beneath a proxy and then a C3P0ConnectionPool object. but BasicResourcePool itself intends to be generally usable, potentially outside of c3p0. "checking in" a foreign resource likely implies a significant bug, and resources thus "checked in" will neither be managed by the pool nor genuinely destroyed. so it's important that BasicResourcePool warn when this occurs. but it's also important that c3p0 not warn if a user has checked in the same resource multiple times: we want it to be permissible for clients to err on the side of over-checking in (because the pool can ignore a second check-in, but a failure to check-in at all implies a resource leak). so c3p0 needs to be able to tell the difference between resources it has once managed but has now purged (don't warn, an extra checkin is useful caution) and resources it has never managed (must warn, likely a programming error and resource leak).
does that make any sense?

Does there exist an open-source distributed logging library?

I'm talking about a library that would allow me to log events from different machines and would align these events on a "global" time axis with sufficiently high precision.
Actually, I'm asking because I've written such a thing myself in the course of a cluster computing project, I found it terrifically useful, and I was surprised that I couldn't find any analogues.
Therefore, the point is whether something like this exists (and I better contribute to it) or nothing exists (and I better write an open-source analogue of my solution).
Here are the features that I'd expect from such a library:
Independence on the clock offset between different machines
Timing precision on the order of at least milliseconds, preferably microseconds
Scalability to thousands of concurrent logging processes, with at least several megabytes of aggregated logs per second
Soft real-time operation (t.i. I don't want to collect 200 big logs from 200 machines and then compute clock offsets and merge them - I want to see what happens "live", perhaps with a small lag like 10s)
Facebook's contribution in the matter is called 'Scribe'.
Excerpt:
Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures. There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups.
...
Scribe is implemented as a thrift service using the non-blocking C++ server. The installation at facebook runs on thousands of machines and reliably delivers tens of billions of messages a day.
The API is Thrift-based, so you have a good platform coverage, but in case you're looking for simple integration for Java you may want to have a look at Digg's log4j appender for Scribe.
You could use log4j/log4net targeting a central syslog daemon. log4j has a builtin SyslogAppender, and in log4net you can do it as shown here. log4cpp docs here.
There are Windows implementations of Syslog around if you don't have a Unix system to hand for this.
Use Chukwa, Its Open source and Large scale Log Monitoring System