Purging Dead Nodes from SGE - sungridengine

My qstat -g c indicates that I have some dead nodes (formally 'cdsuE'):
CLUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoACDS cdsuE
--------------------------------------------------------------------------------
all.q 0.11 18 0 9 37 0 10
Is there an easy way to purge or remove these nodes from the queue?
SGE is smart enough to not allocate work to them but they do clutter up various displays.

I do it the hardway.
Kill the jobs "running" or stuck on dead nodes.
Run the qconf remove node pipeline
-
qconf -dattr hostgroup hostlist <nodealias> #allhosts'
qconf -purge queue slots all.q#<nodealias>
qconf -dconf <nodealias>
qconf -de <nodealias>

If you just want to remove from the queue then removing them from the queue
with:
qconf -dattr queue hostlist <nodename> all.q
or if they're incorporated via a hostgroup
qconf -dattr hostgroup hostlist <nodename> <hostgroup>
This does the minimum needed to get them out of the queue but makes it easy to add them back if you manage to resurect them later.
If there are any ghost jobs on the node then use qdel -f to get rid of them

Related

How many instructions need to be killed on a miss-predict in a 6-stage scalar or superscalar MIPS?

I am working on a pipeline with 6 stages: F D I X0 X1 W. I am asked how many instructions need to be killed when a branch miss-predict happens.
I have come up with 4. I think this because the branch resolution happens in X1 and we will need to kill all the instructions that came after the branch. In the pipeline diagram, it looks like it would require killing 4 instructions that are in the process of flowing through the pipeline. Is that correct?
I am also asked how many need to be killed if the pipeline is a three-wide superscalar. This one I am not sure on. I think that it would be 12 because you can fetch 3 instructions at a time. Is that correct?
kill all the instructions that came after the branch
Not if this is a real MIPS. MIPS has one branch-delay slot: The instruction after a branch always executes whether the branch is taken or not. (jal's return address is the end of the delay slot so it doesn't execute twice.)
This was enough to fully hide the 1 cycle of branch latency on classic MIPS I (R2000), which used a scalar classic RISC 5-stage pipeline. It managed that 1 cycle branch latency by forwarding from the first half of an EX clock cycle to an IF starting in the 2nd half of a clock cycle. This is why MIPS branch conditions are all "simple" (don't need carry propagation through the whole word), like beq between two registers but only one-operand bgez / bltz against an implicit 0 for signed 2's complement comparisons. That only has to check the sign bit.
If your pipeline was well-designed, you'd expect it to resolve branches after X0 because the MIPS ISA is already limited to make low-latency branch decision easy for the ALU. But apparently your pipeline is not optimized and branch decisions aren't ready until the end of X1, defeating the purpose of making it run MIPS code instead of RISC-V or whatever other RISC instruction set.
I have come up with 4. I think this because the branch resolution happens in X1 and we will need to kill all the instructions that came after the branch.
I think 4 cycles looks right for a generic scalar pipeline without a branch delay slot.
At the end of that X1 cycle, there's an instruction in each of the previous 4 pipeline stages, waiting to move to the next stage on that clock edge. (Assuming no other pipeline bubbles). The delay-slot instruction is one of those and doesn't need to be killed.
(Unless there was an I-cache miss fetching the delay slot instruction, in which case the delay slot instruction might not even be in the pipeline yet. So it's not as simple as killing the 3 stages before X0, or even killing all but the oldest previous instruction in the pipeline. Delay slots are not free to implement, also complicating exception handling.)
So 0..3 instructions need to be killed in pipeline stages from F to I. (If it's possible for the delay-slot instruction to be in one of those stages, you have to detect that special case. If it isn't, e.g. I-cache miss latency long enough that it's either in X0 or still waiting to be fetched, then the pipeline can just kill those first 3 stages and do something based on X0 being a bubble or not.)
I think that it would be 12 because you can fetch 3 instructions at a time
No. Remember the branch itself is one of a group of 3 instructions that can go through the pipeline. In the predict-not-taken case, presumably the decode stage would have sent all 3 instructions in that fetch/decode group down the pipe.
The worst case is I think when the branch is the first (oldest in program order) instruction in a group. Then 1 (or 2 with no branch delay slot) instructions from that group in X1 have to be killed, as well as all instructions in previous stages. Then (assuming no bubbles) you're cancelling 13 (or 14) instructions, 3 in each previous stage.
The best case is when the branch is last (youngest in program order) in a group of 3. Then you're discarding 11 (or 12 with no delay slot).
So for a 3-wide version of this pipeline with no delay slot, depending on bubbles in previous pipeline stages, you're killing 0..14 instructions that are in the pipeline already.
Implementing a delay slot sucks; there's a reason newer ISAs don't expose that pipeline detail. Long-term pain for short-term gain.

Cpu Metric in JobHistory logs processed with Rumen

I gathered stats of my jobs in a Hadoop Cluster. I took the JobHistory logs and process them with Rumen. In the json file, for each task attemp, there are a field named "cpuUsages".
Example:
"cpuUsages" : [ 6028, 3967, 3597, 3354, 3225, 3454, 3589, 4316, 42632, 102, 103, 103 ]
I need to know the unit of measurement of this numbers. Some official or academic reference for that?
Rumen extracts metrics from MR job history server. So the metrics is same to the one in MR job history server.
You can see here that MR job history server provides CPU usage in MILLI SECONDS. So the unit of measurement can be Wall CPU usage time in milli seconds.

Distributing push notifications on multiple workers

Say you have millions of Android GCM device keys and you want to send them in a management script. This script will take loads of time to finish as it's processing the keys in the DB as a queue.
Question: How do you implement this faster? how do you send these notifications in parallel? how do you get to near-real-time push notifications?
One solution would to to instantiate an X number of celery workers where each worker is responsible for an offset Y at which it starts fetching from MySQL.
Example:
Worker 1: starts at offset 0,
Worker 2: starts at offset 10,000,
Worker 3: starts at offset 20,000,
Worker 4: starts at offset 30,000,
Worker 5: starts at offset 40,000,
Worker 1: Restarts at offset 50,000,
Worker 2: Restarts at offset 60,000,
... etc
Is this a viable solution?
Create list of tasks as a Celery group. Also because you have to retrieve all records from Android model it's good to create separate celery task which will do it in background:
#shared_task
def push_notification(offset, limit):
for android in Android.objects.all()[offset:offset+limit]:
pass
#shared_task
def push_notification_to_all():
count = Android.objects.all().count()
limit = 100
group(push_notification.s(offset, limit) for offset in range(0, count, limit)()
push_notification_to_all.delay()
Also instead of sending

Only one node owns data in a Cassandra cluster

I am new to Cassandra and just run a cassandra cluster (version 1.2.8) with 5 nodes, and I have created several keyspaces and tables on there. However, I found all data are stored in one node (in the below output, I have replaced ip addresses by node numbers manually):
Datacenter: 105
==========
Address Rack Status State Load Owns Token
4
node-1 155 Up Normal 249.89 KB 100.00% 0
node-2 155 Up Normal 265.39 KB 0.00% 1
node-3 155 Up Normal 262.31 KB 0.00% 2
node-4 155 Up Normal 98.35 KB 0.00% 3
node-5 155 Up Normal 113.58 KB 0.00% 4
and in their cassandra.yaml files, I use all default settings except cluster_name, initial_token, endpoint_snitch, listen_address, rpc_address, seeds, and internode_compression. Below I list those non-ip address fields I modified:
endpoint_snitch: RackInferringSnitch
rpc_address: 0.0.0.0
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "node-1, node-2"
internode_compression: none
and all nodes using the same seeds.
Can I know where I might do wrong in the config? And please feel free to let me know if any additional information is needed to figure out the problem.
Thank you!
If you are starting with Cassandra 1.2.8 you should try using the vnodes feature. Instead of setting the initial_token, uncomment # num_tokens: 256 in the cassandra.yaml, and leave initial_token blank, or comment it out. Then you don't have to calculate token positions. Each node will randomly assign itself 256 tokens, and your cluster will be mostly balanced (within a few %). Using vnodes will also mean that you don't have to "rebalance" you cluster every time you add or remove nodes.
See this blog post for a full description of vnodes and how they work:
http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2
Your token assignment is the problem here. An assigned token are used determines the node's position in the ring and the range of data it stores. When you generate tokens the aim is to use up the entire range from 0 to (2^127 - 1). Tokens aren't id's like with mysql cluster where you have to increment them sequentially.
There is a tool on git that can help you calculate the tokens based on the size of your cluster.
Read this article to gain a deeper understanding of the tokens. And if you want to understand the meaning of the numbers that are generated check this article out.
You should provide a replication_factor when creating a keyspace:
CREATE KEYSPACE demodb
WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 3};
If you use DESCRIBE KEYSPACE x in cqlsh you'll see what replication_factor is currently set for your keyspace (I assume the answer is 1).
More details here

Perl script multi thread not running parallel

I am completely new to Perl, like absolute newbie. I am trying to develop a system which reads a database and, according to the results, generates a queue which launches another script.
HERE is the source code.
Now the script works as expected, except I have noticed that it doesn't really do the threads parallel. Whether I use 1 thread or 50 threads, the execution time is the same; 1 thread is even faster.
When I have the script display which thread did what, I see the threads don't run at the same time, because it will do thread 1, then 2, then 3 etc.
Does anyone know what I did wrong here? Again the script itself works, just not in parallel threads.
You need to learn what semaphores actually are before you start using them. You've explicitly told the threads not to run in parallel:
my $s = Thread::Semaphore->new;
#...
while ($queue_id_list->pending > 0) {
$s->down;
my $info = $queue_id_list->dequeue_nb;
if (defined($info)) {
my #details = split(/#/, $info);
#my $result = system("./match_name db=user_".$details[0]." id=".$details[1]);
# normally the script above would be launched which is a php script run in php-cli and does some database things
sleep(0.1);
#print "Thread: ". threads->self->tid. " - Done user: ".$details[0]. " and addressbook id: ". $details[1]."\r\n";
#print $queue_id_list->pending."\r\n";
}
$s->up;
}
You've created a semaphore $s, which by default has a count of 1. Then in the function you're trying to run, you call $s->down at the start -- which decreases the count by 1, or blocks if the count is already <1, and $s->up at the end, which increases the count by 1.
Once a thread calls down, no other threads will run until it calls up again.
You should carefully read the Thread::Semaphore docs, and probably this wikipedia article on semaphores, too.