Main matrix job sometimes blocks its subjobs - hudson

I have several matrix jobs on a master-slave setup where slaves have only one executor. Sometimes when a matrix job executes, the main job (that triggers all the other ones) occupies a slave, but its subjobs can still run on that slave - which is what I want, since the main job really does not do anything. But sometimes the exact same job would occupy a slave and its subjobs block thinking that there are no available executors.
(1) Does anybody know why the behavior may be different? To me it looks like a bug, but possibly there are subtle reasons that I'm missing.
(2) Can you propose a workaround?
Much obliged.

I have found a workaround: use Matrix Tie Parent Plugin to tie the parent job to master (or some other node) that has multiple executors and is specifically dedicated for the purpose of running matrix parent jobs.

Related

Is it possible to virtualize a single process?

I'm looking for a way to effectively virtualize a single process (and, presumably, any children it creates). Although a Container model sounds appropriate, products like Docker don't quite fit the bill.
The Intel VMX model allows one to create a hypervisor, then launch a VM, which will exit back to the hypervisor under certain (programmable) conditions, such as privileged instruction execution, CR3/CR8 manipulation, exceptions, direct I/O, etc. The features of the VMX model fit very well with my needs, except that I can't require a VM with a separate instance of the entire OS to accomplish the task - I just want my hypervisor to control one child application (think Photoshop/Excel/Firefox; one process and its progeny, if any) that's running under the host OS, and catch VM exits under the specified conditions (for debugging and/or emulation purposes). Outside of the exit conditions, the child process should run unencumbered, and have access to all OS resources to which it would be entitled without the VM, including filesystem, graphical output, keyboard/mouse input, IPC/messaging, etc. For my purposes, I am not interested in isolation or access restriction, which is the typical motivation for using a VM - to the contrary, I want the child process to be fully enmeshed in the host OS environment. While operating entirely in user-space is preferable, I can utilize Ring 0 to facilitate this. (Note that the question is Intel-specific and OS-agnostic, although it's likely to be implemented in a *nix environment first.)
I'm wondering what would happen if I had my hypervisor set up a VMCS that simply mirrored the host's actual configuration, including page tables, IDT, etc., then VMLAUNCH 0(%rip) (in effect, a pseudo-fork?) and execute the child process from there. (That seems far too simplistic to actually work, but the notion does have some appeal). Assuming that's a Bad Idea™, how might I approach this problem?

how to tell NServiceBus is using MaximumConcurrencyLevel?

I'm trying to validate our company's code works when NServiceBus v4.3 is using the MaximumConcurrencyLevel value setup in the config.
The problem is, when I try to process 12k+ of queued entries, I cannot tell any difference in times between the five different max concur levels I change. I set it to 1 and I can process the queue in 8m, then I put it to 2 and I get 9m, seems interesting (I was expecting more, but it's still going in the right direction), but then I put 3, 4, 5 and the timings stay at around 8m. I was expecting a much better throughput.
My question is, how can I verify that NServiceBus is actually indeed using five threads to process entries on the queue?
PS I've tried setting the MaximumConcurrencyLevel="1" and the MaximumMessageThroughputPerSecond along with logging the Thread.CurrentThread.ManagedThreadId thinking\hoping I was ONLY going to see one ThreadID value, but I'm seeing quite a few of different ones, which surprised me. My plan was to see one, then bump the max concur level to 5 and hopefully see five different values.
What am I missing? Thank you in advance.
There can be multiple reasons why you don't see faster processing times when increasing the concurrency setting described on the official documentation page: http://docs.particular.net/nservicebus/operations/tuning
You mentioned you're using the MaximumMessageThroughputPerSecond which will negate any performance gains my parallel message processing if a low value has been configured. Try removing this setting if possible.
Maybe you're accessing a resource in your handlers which isn't supporting/optimized for parallel access.
NServiceBus internally schedules the processing logic on the threadpool. This means that even with a MaximumConcurrencyLevel of 1, you will most likely see a different thread processing each message since there is no thread affinity. But the configuration values work as expected, if your queue contains 5 messages:
it will process these messages one by one if you configured MaximumConcurrencyLevel to 1
it will process all messages in parallel if you configured MaximumConcurrencyLevel to 5.
Depending on your handlers it can of course happen that the first message is already processed at the time the fifth message is read from the queue.

Replication Factor to use for system_auth

When using internal security with Cassandra, what replication factor do you use for system_auth?
The older docs seem to suggest it should be N, where N is the number of nodes, while the newer ones suggest we should set it to a number greater than 1. I can understand why it makes sense for it to be higher - if a partition occurs and one section doesn't have a replica, nobody can log in.
However, does it need to be all nodes? What are the downsides of setting it to all ndoes?
Let me answer this question by posing another:
If (due to some unforseen event) all of your nodes went down, except for one; would you still want to be able to log into (and use) that node?
This is why I actually do ensure that my system_auth keyspace replicates to all of my nodes. You can't predict node failure, and in the interests of keeping your application running, it's better safe than sorry.
I don't see any glaring downsides in doing so. The system_auth keyspace isn't very big (mine is 20kb) so it doesn't take up a lot of space. The only possible scenario, would be if one of the nodes is down, and a write operation is made to a column family in system_auth (in which case, I think the write gets rejected...depending on your write consistency). Either way system_auth isn't a very write-heavy keyspace. So you'll be ok as long as you don't plan on performing user maintenance during a node failure.
Setting the replication factor of system_auth to the number of nodes in the cluster should be ok. At the very least, I would say you should make sure it replicates to all of your data centers.
In case you were still wondering about this part of your question:
The older docs seem to suggest it should be N where n is the number of nodes, while the newer ones suggest we should set it to a number greater than 1."
I stumbled across this today in the 2.1 documentation Configuring Authentication:
Increase the replication factor for the system_auth keyspace to N
(number of nodes).
Just making sure that recommendation was clear.
Addendum 20181108
So I originally answered this back when the largest cluster I had ever managed had 10 nodes. Four years later, after spending three of those managing large (100+) node clusters for a major retailer, my opinions on this have changed somewhat. I can say that I no longer agree with this statement of mine from four years ago.
This is why I actually do ensure that my system_auth keyspace replicates to all of my nodes.
A few times on mind-to-large (20-50 nodes) clusters , we have deployed system_auth with RFs as high as 8. It works as long as you're not moving nodes around, and assumes that the default cassandra/cassandra user is no longer in-play.
The drawbacks were seen on clusters which have a tendency to fluctuate in size. Of course, clusters which change in size usually do so because of high throughput requirements across multiple providers, further complicating things. We noticed that occasionally, the application teams would report auth failures on such clusters. We were able to quickly rectify these situation, by running a SELECT COUNT(*) on all system_auth tables, thus forcing a read repair. But the issue would tend to resurface the next time we added/removed several nodes.
Due to issues that can happen with larger clusters which fluctuate in size, we now treat system_auth like we do any other keyspace. That is, we set system_auth's RF to 3 in each DC.
That seems to work really well. It gets you around the problems that come with having too many replicas to manage in a high-throughput, dynamic environment. After all, if RF=3 is good enough for your application's data, it's probably good enough for system_auth.
The reason the recommendation changed was that a quorum query would require responses from over half of your nodes to fullfill. So if you accidentally left Cassandra user active and you have 80 nodes - we need 41 responses.
Whilst it's good practice to avoid using the super user like that - you'd be surprised how often it's still out there.

Performing a distributed CUDA/OpenCL based password cracking

Is there a way to perform a distributed (as in a cluster of a connected computers) CUDA/openCL based dictionary attack?
For example, if I have a one computer with some NVIDIA card that is sharing the load of the dictionary attack with another coupled computer and thus utilizing a second array of GPUs there?
The idea is to ensure a scalability option for future expanding without the need of replacing the whole set of hardware that we are using. (and let's say cloud is not an option)
This is a simple master / slave work delegation problem. The master work server hands out to any connecting slave process a unit of work. Slaves work on one unit and queue one unit. When they complete a unit, they report back to the server. Work units that are exhaustively checked are used to estimate operations per second. Depending on your setup, I would adjust work units to be somewhere in the 15-60 second range. Anything that doesn't get a response by the 10 minute mark is recycled back into the queue.
For queuing, offer the current list of uncracked hashes, the dictionary range to be checked, and the permutation rules to be applied. The master server should be able to adapt queues per machine and per permutation rule set so that all machines are done their work within a minute or so of each other.
Alternately, coding could be made simpler if each unit of work were the same size. Even then, no machine would be idle longer than the amount of time for the slowest machine to complete one unit of work. Size your work units so that the fastest machine doesn't enter a case of resource starvation (shouldn't complete work faster than five seconds, should always have a second unit queued). Using that method, hopefully your fastest machine and slowest machine aren't different by a factor of more than 100x.
It would seem to me that it would be quite easy to write your own service that would do just this.
Super Easy Setup
Let's say you have some GPU enabled program X that takes a hash h as input and a list of dictionary words D, then uses the dictionary words to try and crack the password. With one machine, you simply run X(h,D).
If you have N machines, you split the dictionary into N parts (D_1, D_2, D_3,...,D_N). Then run P(x,D_i) on machine i.
This could easily be done using SSH. The master machine splits the dictionary up, copies it to each of the slave machines using SCP, then connects to the slaves and tells them to run the program.
Slightly Smarter Setup
When one machine cracks the password, they could easily notify the master that they have completed the task. The master then kills the programs running on the other slaves.

Associative cache simulation - Dealing with a Faulty Scheme

While working on simulating a fully associative cache (in MIPS assembly), a couple of questions came to mind based on some information read online;
According to some notes from the University of Maryland
Finding a slot: At most, one slot should match. If
there is more than one slot that
matches, then you have a faulty
fully-associative cache scheme. You
should never have more than one copy
of the cache line in any slot of a
fully-associative cache. It's hard to
maintain multiple copies, and doesn't
make sense. The slots could be used
for other cache lines.
Does that mean that I should check all the time the whole tag list in order to check for a second match? After all if I don't, i will never "realize" about the fault with the cache, yet, checking every single time seems quite inefficient.
In the case I do check, and somehow I manage to find a second match, meaning faulty cache scheme, what shall I do then? Although the best answer would be to fix my implementation, yet Im interested on how to handle it during execution if this situation should arise.
If more than one valid slot matches an address, then that means that when a previous search for the same address was executed, either a valid slot that should have matched the address was not used (perhaps because it was not checked in the first place) or more than one invalid slot was used to store the line that wasn't in the cache at all.
Without a doubt, this should be considered a bug.
But if we've just decided not to fix the bug (maybe we'd rather not commit that much hardware to a better implementation) the most obvious option is to pick one of the slots to invalidate. It will then be available for other cache lines.
As for how to pick which one to invalidate, if one of the duplicate lines is clean, invalidate that one in preference to a dirty cache line. If more than cache line is dirty and they disagree you have an even bigger bug to fix, but at any rate your cache is out of sync and it probably doesn't matter which you pick.
Edit: here's how I might implement hardware to do this:
First off, it doesn't make a whole lot of sense to start with the assumption of duplicates, rather we'll work around that at the appropriate time later. There are a few possibilities of what must happen when caching a new line.
The line is already in the cache, no action is needed
The line is not in the cache but there are invalid slots available: Place the new line into one of the available slots
The line is not in the cache but there are no invalid slots available. Another valid line must be evicted and the new line takes its place.
Picking an eviction candidate has performance consequences. Clean cache lines can be evicted for free, but if chosen poorly, it can cause another cache miss in the near future. Consider if all but one cache line is dirty. If only the clean cache line is evicted, then many sequential reads alternating between two addresses will cause a cache miss on every read. Cache invalidation is among the two hard problems in Comp Sci (the other being 'naming things') and out of the scope of this exact question.
I would probably implement a search that checks for the correct slot to act on for each of these. Then another block would pick the first from that list and act on it.
Now, getting back to the question. What are the conditions under which duplicates could possibly enter the cache. If memory accesses are strictly ordered, and the implementation (as above) is correct, I don't think duplicates are possible at all. And thus there's no need to check for them.
Now lets consider a more implausible case where A single cache is shared across two CPU cores. We're going to just do the simplest thing that could work and duplicate everything except the cache memory itself for each core. Thus the slot searching hardware is not shared. To support this, an extra bit per slot is used as a mutex. search hardware cannot use a slot that is locked by the other core. specifically,
If the address is in the cache, try to lock the slot and return that slot. If the slot is already locked, stall until it is free.
If the address is not in the cache, find an unlocked slot that is invalid or valid but evictable.
in this case we actually can end up in a position where two slots share the same address. If both cores try to write to an address that is not in the cache, they will end up getting different slots, and a duplicate line will occur. First lets think about what could happen:
Both lines were reads from main memory. They will be the same value and they will both be clean. It is correct to evict either.
Both lines were writes. Both will be dirty, but probably not be equal. This is a race condition that should have been resolved by the application by issuing memory fences or some other memory ordering instructions. We cannot guess which one should be used, if there was no cache the race condition would persist into RAM. It is correct to evict either.
One line was a read and one was a write. The write is dirty but the read is clean. Once again this race condition would have persisted into RAM if there was no intervening cache, but the reader could have seen a different value. evicting the clean line is right by RAM, and also has the side effect of always favoring read then write ordering.
So now we know what to do about it, but where does this logic belong. First lets think about what could happen if we don't do anything. A subsequent cache access for the same address on either core could return either line. Even if neither core is issuing writes, reads could keep coming up different, alternating between the two values. This breaks every conceivable idea about memory ordering.
one solution might be to just say that dirty lines belong to one core only, the line is not dirty, but dirty and owned by another core.
In the case of two concurrent reads, both lines are identical, unlocked and interchangeable. It doesn't matter which line a core gets for subsequent operations.
in the case of concurrent writes, both lines are out of sync, but mutually invisible. Although the race condition that this creates is unfortunate, it still leads to a reasonable memory ordering, as if all of the operations that happen on the discarded line happened before any of the operations on the cleaned line.
If a read and a write happen concurrently, the dirty line is invisible to the reading core. However, the clean line is visible to both cores, and would cause memory ordering to break down for the writer. future writes could even cause it to lock both (because both would be dirty).
That last case pretty much militates that dirty lines be preferred to clean ones. This forces at least some extra hardware to look for dirty lines first and clean lines only if no dirty lines were found. So now we have a new concurrent cache implementation:
If the address is in the cache and dirty and owned by the requesting core, use that slot
if the address is in the cache but clean
for reads, just use that slot
for writes, mark the slot as dirty and use that slot
if the address is not in the cache and there are invalid slots, use an invalid slot
if there are no invalid slots, evict a line and use that slot.
We're getting closer, there's still a hole in the implementation. What if both cores access the same address but not concurrently. The simplest thing is probably to just say that dirty lines are really invisible to other cores. In cache but dirty is the same as not being in the cache at all.
Now all we have to think about is actually providing the tool for applications to synchronize. I'd probably do a tool that just explicitly flushes a line if it is dirty. This would just invoke the same hardware that is used during eviction, but marks the line as clean instead of invalid.
To make a long post short, the idea is to deal with the duplicates not by removing them, but by making sure they cannot lead to further memory ordering issues, and leaving the deduplication work to the application or eventual eviction.