Couchbase created huge GSI index files - couchbase

We faced an issue with Couchbase, where it generated 150Gb secondary index file for a bucket which has only 26.4GB data.
We got an error in the night:
2018-09-13T22:47:18.552Z+05:30 [Error] IndexerSettingsManager: metakv notifier failed (unexpected EOF)..Restarting
2018/09/13 22:47:18 revrpc: Got error (EOF) and will retry in 1s
panic: 2018-09-13T22:47:24.896Z+05:30 [Error] IndexerSettingsManager: metakv notifier failed (Get http://127.0.0.1:8091/_metakv/?feed=continuous: CBAuth database is stale: last reason: EOF)..Restarting
CBAuth database is stale: last reason: EOF
After that indexer was only publishing these logs:
2018-09-14T03:53:45.284Z+05:30 [Warn] Indexer::MutationQueue Waiting for Node Alloc for 8340 Milliseconds Vbucket 984
2018-09-14T03:53:45.314Z+05:30 [Warn] Indexer::MutationQueue Waiting for Node Alloc for 8370 Milliseconds Vbucket 984
2018-09-14T03:53:45.344Z+05:30 [Warn] Indexer::MutationQueue Waiting for Node Alloc for 8400 Milliseconds Vbucket 984
2018-09-14T03:53:45.374Z+05:30 [Warn] Indexer::MutationQueue Waiting for Node Alloc for 8430 Milliseconds Vbucket 984
2018-09-14T03:53:45.404Z+05:30 [Warn] Indexer::MutationQueue Waiting for Node Alloc for 8460 Milliseconds Vbucket 984
2018-09-14T03:53:45.434Z+05:30 [Warn] Indexer::MutationQueue Waiting for Node Alloc for 8490 Milliseconds Vbucket 984
2018-09-14T03:53:45.464Z+05:30 [Warn] Indexer::MutationQueue Waiting for Node Alloc for 8520 Milliseconds Vbucket 984
2018-09-14T03:53:45.494Z+05:30 [Warn] Indexer::MutationQueue Waiting for Node Alloc for 8550 Milliseconds Vbucket 984
2018-09-14T03:53:45.524Z+05:30 [Warn] Indexer::MutationQueue Waiting for Node Alloc for 8580 Milliseconds Vbucket 984
2018-09-14T03:53:45.554Z+05:30 [Warn] Indexer::MutationQueue Waiting for Node Alloc for 8610 Milliseconds Vbucket 984
2018-09-14T03:53:45.584Z+05:30 [Warn] Indexer::MutationQueue Waiting for Node Alloc for 8640 Milliseconds Vbucket 984
2018-09-14T03:53:45.614Z+05:30 [Warn] Indexer::MutationQueue Waiting for Node Alloc for 8670 Milliseconds Vbucket 984
2018-09-14T03:53:45.644Z+05:30 [Warn] Indexer::MutationQueue Waiting for Node Alloc for 8700 Milliseconds Vbucket 984
2018-09-14T03:53:45.674Z+05:30 [Warn] Indexer::MutationQueue Waiting for Node Alloc for 8730 Milliseconds Vbucket 984
2018-09-14T03:53:45.7Z+05:30 [Info] logWriterStat:: 5308082947943612037 FlushedCount 33730000 QueuedCount 7840
However it was I guess writing index file also in background and it created following files, which cause 100% disk full.
sudo ls -lrth /storage/1/couchbase/index/#2i/myBucket_idx_5308082947943612037_0.index
total 181G
-rw-rw---- 1 couchbase couchbase 31G Sep 14 03:53 data.fdb.418
-rw-rw---- 1 couchbase couchbase 150G Sep 14 03:59 data.fdb.417
Couchbase version Community: 4.0.0-4051-1
Please let us know, how to debug it further, what can be cause of this?
I see few issues related to this, but not much help from them: MB-19145, MB-20464, MB-18912, MB-14962, MB-25086
Edit:
index on this table:
"indexes": {
"datastore_id": "http://127.0.0.1:8091",
"id": "40d3a82d781dad05",
"index_key": [],
"is_primary": true,
"keyspace_id": "myTable",
"name": "#primary",
"namespace_id": "default",
"state": "online",
"using": "gsi"
},
{
"indexes": {
"datastore_id": "http://127.0.0.1:8091",
"id": "b870f85dc1cadd5",
"index_key": [
"(meta().`id`)"
],
"keyspace_id": "myTable",
"name": "idx",
"namespace_id": "default",
"state": "online",
"using": "gsi"
}

Related

Unable to start nginx-ingress-controller Readiness and Liveness probes failed

I have installed using instructions at this link for the Install NGINX using NodePort option.
When I do ks logs -f ingress-nginx-controller-7f48b8-s7pg4 -n ingress-nginx I get :
W0304 09:33:40.568799 8 client_config.go:614] Neither --kubeconfig nor --master was
specified. Using the inClusterConfig. This might not work.
I0304 09:33:40.569097 8 main.go:241] "Creating API client" host="https://10.96.0.1:443"
I0304 09:33:40.584904 8 main.go:285] "Running in Kubernetes cluster" major="1" minor="23" git="v1.23.1+k0s" state="clean" commit="b230d3e4b9d6bf4b731d96116a6643786e16ac3f" platform="linux/amd64"
I0304 09:33:40.911443 8 main.go:105] "SSL fake certificate created" file="/etc/ingress-controller/ssl/default-fake-certificate.pem"
I0304 09:33:40.916404 8 main.go:115] "Enabling new Ingress features available since Kubernetes v1.18"
W0304 09:33:40.918137 8 main.go:127] No IngressClass resource with name nginx found. Only annotation will be used.
I0304 09:33:40.942282 8 ssl.go:532] "loading tls certificate" path="/usr/local/certificates/cert" key="/usr/local/certificates/key"
I0304 09:33:40.977766 8 nginx.go:254] "Starting NGINX Ingress controller"
I0304 09:33:41.007616 8 event.go:282] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"ingress-nginx", Name:"ingress-nginx-controller", UID:"1a4482d2-86cb-44f3-8ebb-d6342561892f", APIVersion:"v1", ResourceVersion:"987560", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap ingress-nginx/ingress-nginx-controller
E0304 09:33:42.087113 8 reflector.go:138] k8s.io/client-go#v0.20.2/tools/cache/reflector.go:167: Failed to watch *v1beta1.Ingress: failed to list *v1beta1.Ingress: the server could not find the requested resource
E0304 09:33:43.041954 8 reflector.go:138] k8s.io/client-go#v0.20.2/tools/cache/reflector.go:167: Failed to watch *v1beta1.Ingress: failed to list *v1beta1.Ingress: the server could not find the requested resource
E0304 09:33:44.724681 8 reflector.go:138] k8s.io/client-go#v0.20.2/tools/cache/reflector.go:167: Failed to watch *v1beta1.Ingress: failed to list *v1beta1.Ingress: the server could not find the requested resource
E0304 09:33:48.303789 8 reflector.go:138] k8s.io/client-go#v0.20.2/tools/cache/reflector.go:167: Failed to watch *v1beta1.Ingress: failed to list *v1beta1.Ingress: the server could not find the requested resource
E0304 09:33:59.113203 8 reflector.go:138] k8s.io/client-go#v0.20.2/tools/cache/reflector.go:167: Failed to watch *v1beta1.Ingress: failed to list *v1beta1.Ingress: the server could not find the requested resource
E0304 09:34:16.727052 8 reflector.go:138] k8s.io/client-go#v0.20.2/tools/cache/reflector.go:167: Failed to watch *v1beta1.Ingress: failed to list *v1beta1.Ingress: the server could not find the requested resource
I0304 09:34:39.216165 8 main.go:187] "Received SIGTERM, shutting down"
I0304 09:34:39.216773 8 nginx.go:372] "Shutting down controller queues"
E0304 09:34:39.217779 8 store.go:178] timed out waiting for caches to sync
I0304 09:34:39.217856 8 nginx.go:296] "Starting NGINX process"
I0304 09:34:39.218007 8 leaderelection.go:243] attempting to acquire leader lease ingress-nginx/ingress-controller-leader-nginx...
I0304 09:34:39.219741 8 queue.go:78] "queue has been shutdown, failed to enqueue" key="&ObjectMeta{Name:initial-sync,GenerateName:,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[]OwnerReference{},Finalizers:[],ClusterName:,ManagedFields:[]ManagedFieldsEntry{},}"
I0304 09:34:39.219787 8 nginx.go:316] "Starting validation webhook" address=":8443" certPath="/usr/local/certificates/cert" keyPath="/usr/local/certificates/key"
I0304 09:34:39.242501 8 leaderelection.go:253] successfully acquired lease ingress-nginx/ingress-controller-leader-nginx
I0304 09:34:39.242807 8 queue.go:78] "queue has been shutdown, failed to enqueue" key="&ObjectMeta{Name:sync status,GenerateName:,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[]OwnerReference{},Finalizers:[],ClusterName:,ManagedFields:[]ManagedFieldsEntry{},}"
I0304 09:34:39.242837 8 status.go:84] "New leader elected" identity="ingress-nginx-controller-7f48b8-s7pg4"
I0304 09:34:39.252025 8 status.go:204] "POD is not ready" pod="ingress-nginx/ingress-nginx-controller-7f48b8-s7pg4" node="fbcdcesdn02"
I0304 09:34:39.255282 8 status.go:132] "removing value from ingress status" address=[]
I0304 09:34:39.255328 8 nginx.go:380] "Stopping admission controller"
I0304 09:34:39.255379 8 nginx.go:388] "Stopping NGINX process"
E0304 09:34:39.255664 8 nginx.go:319] "Error listening for TLS connections" err="http: Server closed"
2022/03/04 09:34:39 [notice] 43#43: signal process started
I0304 09:34:40.263361 8 nginx.go:401] "NGINX process has stopped"
I0304 09:34:40.263396 8 main.go:195] "Handled quit, awaiting Pod deletion"
I0304 09:34:50.263585 8 main.go:198] "Exiting" code=0
When I do ks describe pod ingress-nginx-controller-7f48b8-s7pg4 -n ingress-nginx I get :
Name: ingress-nginx-controller-7f48b8-s7pg4
Namespace: ingress-nginx
Priority: 0
Node: fxxxxxxxx/10.XXX.XXX.XXX
Start Time: Fri, 04 Mar 2022 08:12:57 +0200
Labels: app.kubernetes.io/component=controller
app.kubernetes.io/instance=ingress-nginx
app.kubernetes.io/name=ingress-nginx
pod-template-hash=7f48b8
Annotations: kubernetes.io/psp: 00-k0s-privileged
Status: Running
IP: 10.244.0.119
IPs:
IP: 10.244.0.119
Controlled By: ReplicaSet/ingress-nginx-controller-7f48b8
Containers:
controller:
Container ID: containerd://638ff4d63b7ba566125bd6789d48db6e8149b06cbd9d887ecc57d08448ba1d7e
Image: k8s.gcr.io/ingress-nginx/controller:v0.48.1#sha256:e9fb216ace49dfa4a5983b183067e97496e7a8b307d2093f4278cd550c303899
Image ID: k8s.gcr.io/ingress-nginx/controller#sha256:e9fb216ace49dfa4a5983b183067e97496e7a8b307d2093f4278cd550c303899
Ports: 80/TCP, 443/TCP, 8443/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
Args:
/nginx-ingress-controller
--election-id=ingress-controller-leader
--ingress-class=nginx
--configmap=$(POD_NAMESPACE)/ingress-nginx-controller
--validating-webhook=:8443
--validating-webhook-certificate=/usr/local/certificates/cert
--validating-webhook-key=/usr/local/certificates/key
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 04 Mar 2022 11:33:40 +0200
Finished: Fri, 04 Mar 2022 11:34:50 +0200
Ready: False
Restart Count: 61
Requests:
cpu: 100m
memory: 90Mi
Liveness: http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=5
Readiness: http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
Environment:
POD_NAME: ingress-nginx-controller-7f48b8-s7pg4 (v1:metadata.name)
POD_NAMESPACE: ingress-nginx (v1:metadata.namespace)
LD_PRELOAD: /usr/local/lib/libmimalloc.so
Mounts:
/usr/local/certificates/ from webhook-cert (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zvcnr (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
webhook-cert:
Type: Secret (a volume populated by a Secret)
SecretName: ingress-nginx-admission
Optional: false
kube-api-access-zvcnr:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 23m (x316 over 178m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 500
Warning BackOff 8m52s (x555 over 174m) kubelet Back-off restarting failed container
Normal Pulled 3m54s (x51 over 178m) kubelet Container image "k8s.gcr.io/ingress-nginx/controller:v0.48.1#sha256:e9fb216ace49dfa4a5983b183067e97496e7a8b307d2093f4278cd550c303899" already present on machine
When I try to curl the health endpoints I get Connection refused :
The state of the pods shows that they are both not ready :
NAME READY STATUS RESTARTS AGE
ingress-nginx-admission-create-4hzzk 0/1 Completed 0 3h30m
ingress-nginx-controller-7f48b8-s7pg4 0/1 CrashLoopBackOff 63 (91s ago) 3h30m
I have tried to increase the values for initialDelaySeconds in /etc/nginx/nginx.conf but when I attempt to exec into the container (ks exec -it -n ingress-nginx ingress-nginx-controller-7f48b8-s7pg4 -- bash) I also get an error error: unable to upgrade connection: container not found ("controller")
I am not really sure where I should be looking in the overall setup.
I have installed using instructions at this link for the Install NGINX using NodePort option.
The problem is that you are using outdated k0s documentation:
https://docs.k0sproject.io/v1.22.2+k0s.1/examples/nginx-ingress/
You should use this link instead:
https://docs.k0sproject.io/main/examples/nginx-ingress/
You will install the controller-v1.0.0 version on your Kubernetes cluster by following the actual documentation link.
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.0.0/deploy/static/provider/baremetal/deploy.yaml
The result is:
$ sudo k0s kubectl get pods -n ingress-nginx
NAME READY STATUS RESTARTS AGE
ingress-nginx-admission-create-dw2f4 0/1 Completed 0 11m
ingress-nginx-admission-patch-4dmpd 0/1 Completed 0 11m
ingress-nginx-controller-75f58fbf6b-xrfxr 1/1 Running 0 11m

MySQL shutdown issue on Magento

We have a magento website
Our website some times it showing below error like
There has been an error processing your request
Exception printing is disabled by default for security reasons.
Error log record number: 855613014442
Based on our logs, it is showing that Mysql is going down as shown below
2019-06-24T04:44:49.542168Z 0 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.7.26' socket: '/var/lib/mysql/mysql.sock' port: 3306 MySQL Community Server (GPL)
2019-06-24T04:44:50.594943Z 0 [Note] InnoDB: Buffer pool(s) load completed at 190624 4:44:50
2019-06-24T04:45:11.103402Z 0 [Note] Giving 0 client threads a chance to die gracefully
2019-06-24T04:45:11.103429Z 0 [Note] Shutting down slave threads
2019-06-24T04:45:11.103438Z 0 [Note] Forcefully disconnecting 0 remaining clients
2019-06-24T04:45:11.103444Z 0 [Note] Event Scheduler: Purging the queue. 0 events
2019-06-24T04:45:11.103484Z 0 [Note] Binlog end
We have increased innodb_buffer_pool_size but still i am facing same issue.
I have executed below commands in my server..check it these outputs
1)free -m
Output:
total used free shared buff/cache available
Mem: 7819 1430 4688 81 1701 6009
Swap: 0 0 0
2)dmesg | tail -30
Output:
[ 6.222373] [TTM] Initializing pool allocator
[ 6.241079] [TTM] Initializing DMA pool allocator
[ 6.255768] [drm] fb mappable at 0xF0000000
[ 6.259225] [drm] vram aper at 0xF0000000
[ 6.262574] [drm] size 33554432
[ 6.265475] [drm] fb depth is 24
[ 6.268473] [drm] pitch is 3072
[ 6.289079] fbcon: cirrusdrmfb (fb0) is primary device
[ 6.346169] Console: switching to colour frame buffer device 128x48
[ 6.347151] loop: module loaded
[ 6.357709] cirrus 0000:00:02.0: fb0: cirrusdrmfb frame buffer device
[ 6.364646] [drm] Initialized cirrus 1.0.0 20110418 for 0000:00:02.0 on minor 0
[ 6.722341] input: PC Speaker as /devices/platform/pcspkr/input/input4
[ 6.788110] EXT4-fs (loop0): mounting ext3 file system using the ext4 subsystem
[ 6.802845] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
[ 6.841332] cryptd: max_cpu_qlen set to 1000
[ 6.871200] AVX2 version of gcm_enc/dec engaged.
[ 6.873349] AES CTR mode by8 optimization enabled
[ 6.936609] EXT4-fs (loop0): mounting ext3 file system using the ext4 subsystem
[ 6.949717] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
[ 6.964446] alg: No test for __gcm-aes-aesni (__driver-gcm-aes-aesni)
[ 6.984659] alg: No test for __generic-gcm-aes-aesni (__driver-generic-gcm-aes-aesni)
[ 7.084148] intel_rapl: Found RAPL domain package
[ 7.086591] intel_rapl: Found RAPL domain dram
[ 7.088788] intel_rapl: DRAM domain energy unit 15300pj
[ 7.102115] EDAC sbridge: Seeking for: PCI ID 8086:6fa0
[ 7.102119] EDAC sbridge: Ver: 1.1.2
[ 7.175339] ppdev: user-space parallel port driver
[ 10.728980] ip6_tables: (C) 2000-2006 Netfilter Core Team
[ 10.772307] nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
3)ps auxw | grep mysql
Output:
mysql 5056 2.9 10.8 7009056 871240 ? Sl 12:29 0:12 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid
root 5538 0.0 0.0 112708 976 pts/0 S+ 12:36 0:00 grep --color=auto mysql
Can anyone has idea how to resolve this issue.
Thanks

Could not initialize corosync configuration API error 12

Unable to initialize corosync running inside a docker container. The corosync-cfgtool -s command yields the following:
Could not initialize corosync configuration API error 12
The /etc/corosync/corosync.conf file has the following:
compatibility: whitetank
totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
bindnetaddr: 127.0.0.1
mcastaddr: 239.255.1.1
mcastport: 5405
ttl: 1
}
}
logging {
fileline: off
to_stderr: no
to_logfile: yes
logfile: /var/log/corosync.log
to_syslog: yes
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}
The /var/log/corosync.log file shows the following:
May 02 20:13:22 corosync [MAIN ] Could not set SCHED_RR at priority 99: Operation not permitted (1)
May 02 20:13:22 corosync [MAIN ] Could not lock memory of service to avoid page faults: Cannot allocate memory (12)
May 02 20:13:22 corosync [MAIN ] Corosync Cluster Engine ('1.4.6'): started and ready to provide service.
May 02 20:13:22 corosync [MAIN ] Corosync built-in features: nss
May 02 20:13:22 corosync [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'.
May 02 20:13:22 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).
May 02 20:13:22 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
I was running the following in a bash script:
service corosync start
service corosync status
corosync-cfgtool -s
Apparently it was running too quickly and not giving corosync enough time to initialize. Changing the script to the following seems to have worked:
service corosync start
service corosync status
sleep 5
corosync-cfgtool -s
I now see the following output from corosync-cfgtool -s:
Printing ring status.
Local node ID 16777343
RING ID 0
id = 127.0.0.1
status = ring 0 active with no faults

Unable to use JMockit with OpenJDK 1.7

While trying to use JMockit (1.21) with JUnit (4.8) test cases I ran into issue with OpenJDK (1.7). I'm using Eclipse. After searching on SO, I found the solution about adding '-javaagent:path/to/JMockit/jar' argument to JVM and putting JMockit dependency before JUnit in maven. But after adding that argument, my test won't run and instead I get following error. Did anyone have this issue and how did you solve it? It works if I use OracleJDK but I'm looking for solution where it runs with OpenJDK.
#
# A fatal error has been detected by the Java Runtime Environment:
#
# EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x0000000051cfbbe8, pid=9268, tid=2272
#
# JRE version: OpenJDK Runtime Environment (7.0) (build 1.7.0-45-asmurthy_2014_01_10_19_46-b00)
# Java VM: Dynamic Code Evolution 64-Bit Server VM (24.45-b06 mixed mode windows-amd64 compressed oops)
# Problematic frame:
# V [jvm.dll+0x6bbe8]
#
# Failed to write core dump. Minidumps are not enabled by default on client versions of Windows
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
--------------- T H R E A D ---------------
Current thread (0x000000000270c000): VMThread [stack: 0x00000000074e0000,0x00000000075e0000] [id=2272]
siginfo: ExceptionCode=0xc0000005, reading address 0x0000000000000000
Registers:
RAX=0x0000000000000000, RBX=0x0000000000000000, RCX=0x0000000000000000, RDX=0x00000007fae04720
RSP=0x00000000075df190, RBP=0x00000000523a16d8, RSI=0x0000000002abe2b0, RDI=0x00000000523a16e0
R8 =0x0000000000000000, R9 =0x0000000000000100, R10=0x0000000000041999, R11=0x0000000008eab1f0
R12=0x000000000270c000, R13=0x00000007fb208040, R14=0x0000000000000001, R15=0x00000000000003d8
RIP=0x0000000051cfbbe8, EFLAGS=0x0000000000010206
Top of Stack: (sp=0x00000000075df190)
0x00000000075df190: 00000007fb208050 0000000002abe200
0x00000000075df1a0: 00000007fb208040 000000000270c000
0x00000000075df1b0: 00000000523a16e0 0000000051e0ecbc
0x00000000075df1c0: 0000000000000000 00000000523a16d8
0x00000000075df1d0: 0000000002abe2b0 000000000270c000
0x00000000075df1e0: 00000000521ea7b8 0000000052450100
0x00000000075df1f0: 0000000000000000 00000000521ea7a0
0x00000000075df200: 0000000000001000 00000000075df1e0
0x00000000075df210: 0000000000000100 0000000000000000
0x00000000075df220: 00000000073158d8 00000000000003d8
0x00000000075df230: 00000000073158d8 0000000002701ac0
0x00000000075df240: 00000000073154f0 0000000051e74dc7
0x00000000075df250: 00000000026c3de0 0000000000000001
0x00000000075df260: 0000000002abe2b0 0000000007315500
0x00000000075df270: 0000000007315500 00000000073154f0
0x00000000075df280: 0000000002701ac0 0000000051e74072
Instructions: (pc=0x0000000051cfbbe8)
0x0000000051cfbbc8: cc cc cc cc cc cc cc cc 48 89 5c 24 08 57 48 83
0x0000000051cfbbd8: ec 20 48 8b 05 17 20 69 00 48 8b 0d c0 e5 68 00
0x0000000051cfbbe8: 48 63 18 e8 60 c3 fa ff 33 ff 48 85 db 7e 37 66
0x0000000051cfbbf8: 0f 1f 84 00 00 00 00 00 48 8b 05 f1 1f 69 00 48
Register to memory mapping:
RAX=0x0000000000000000 is an unknown value
RBX=0x0000000000000000 is an unknown value
RCX=0x0000000000000000 is an unknown value
RDX=0x00000007fae04720 is an oop
{instance class}
- klass: {other class}
RSP=0x00000000075df190 is an unknown value
RBP=0x00000000523a16d8 is an unknown value
RSI=0x0000000002abe2b0 is pointing into the stack for thread: 0x00000000026c7000
RDI=0x00000000523a16e0 is an unknown value
R8 =0x0000000000000000 is an unknown value
R9 =0x0000000000000100 is an unknown value
R10=0x0000000000041999 is an unknown value
R11=0x0000000008eab1f0 is an unknown value
R12=0x000000000270c000 is an unknown value
R13=0x00000007fb208040 is an oop
{instance class}
- klass: {other class}
R14=0x0000000000000001 is an unknown value
R15=0x00000000000003d8 is an unknown value
Stack: [0x00000000074e0000,0x00000000075e0000], sp=0x00000000075df190, free space=1020k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [jvm.dll+0x6bbe8]
VM_Operation (0x0000000002abe2b0): RedefineClasses, mode: safepoint, requested by thread 0x00000000026c7000
--------------- P R O C E S S ---------------
Java Threads: ( => current thread )
0x000000000739c800 JavaThread "Attach Listener" daemon [_thread_blocked, id=6668, stack(0x0000000007c60000,0x0000000007d60000)]
0x000000000739b000 JavaThread "Signal Dispatcher" daemon [_thread_blocked, id=5620, stack(0x0000000007aa0000,0x0000000007ba0000)]
0x0000000007379800 JavaThread "Finalizer" daemon [_thread_blocked, id=9832, stack(0x00000000078c0000,0x00000000079c0000)]
0x0000000007370000 JavaThread "Reference Handler" daemon [_thread_blocked, id=7516, stack(0x0000000007680000,0x0000000007780000)]
0x00000000026c7000 JavaThread "main" [_thread_blocked, id=6580, stack(0x00000000029c0000,0x0000000002ac0000)]
Other Threads:
=>0x000000000270c000 VMThread [stack: 0x00000000074e0000,0x00000000075e0000] [id=2272]
VM state:at safepoint (normal execution)
VM Mutex/Monitor currently owned by a thread: ([mutex/lock_event])
[0x00000000026c3e60] Threads_lock - owner thread: 0x000000000270c000
[0x00000000026c4360] Heap_lock - owner thread: 0x00000000026c7000
[0x00000000026c4b60] RedefineClasses_lock - owner thread: 0x00000000026c7000
Heap
def new generation total 118016K, used 21035K [0x000000067ae00000, 0x0000000682e00000, 0x00000006fae00000)
eden space 104960K, 20% used [0x000000067ae00000, 0x000000067c28af18, 0x0000000681480000)
from space 13056K, 0% used [0x0000000681480000, 0x0000000681480000, 0x0000000682140000)
to space 13056K, 0% used [0x0000000682140000, 0x0000000682140000, 0x0000000682e00000)
tenured generation total 262144K, used 0K [0x00000006fae00000, 0x000000070ae00000, 0x00000007fae00000)
the space 262144K, 0% used [0x00000006fae00000, 0x00000006fae00000, 0x00000006fae00200, 0x000000070ae00000)
compacting perm gen total 21248K, used 4129K [0x00000007fae00000, 0x00000007fc2c0000, 0x0000000800000000)
the space 21248K, 19% used [0x00000007fae00000, 0x00000007fb208478, 0x00000007fb208600, 0x00000007fc2c0000)
No shared spaces configured.
Card table byte_map: [0x0000000005e60000,0x0000000006a90000] byte_map_base: 0x0000000002a89000
Polling page: 0x0000000000150000
Code Cache [0x0000000002da0000, 0x0000000003010000, 0x0000000005da0000)
total_blobs=187 nmethods=0 adapters=156 free_code_cache=48761Kb largest_free_block=49932032
Compilation events (0 events):
No events
GC Heap History (0 events):
No events
Deoptimization events (0 events):
No events
Internal exceptions (10 events):
Event: 1.210 Thread 0x00000000026c7000 Threw 0x000000067c036fa0 at C:\openjdk\jdk7u\hotspot\src\share\vm\interpreter\interpreterRuntime.cpp:347
Event: 1.211 Thread 0x00000000026c7000 Threw 0x000000067c03ae78 at C:\openjdk\jdk7u\hotspot\src\share\vm\prims\jvm.cpp:1244
Event: 1.212 Thread 0x00000000026c7000 Threw 0x000000067c045f10 at C:\openjdk\jdk7u\hotspot\src\share\vm\prims\jvm.cpp:1244
Event: 1.213 Thread 0x00000000026c7000 Threw 0x000000067c057398 at C:\openjdk\jdk7u\hotspot\src\share\vm\prims\jvm.cpp:1244
Event: 1.214 Thread 0x00000000026c7000 Threw 0x000000067c066510 at C:\openjdk\jdk7u\hotspot\src\share\vm\prims\jvm.cpp:1244
Event: 1.215 Thread 0x00000000026c7000 Threw 0x000000067c072ac0 at C:\openjdk\jdk7u\hotspot\src\share\vm\prims\jvm.cpp:1244
Event: 1.216 Thread 0x00000000026c7000 Threw 0x000000067c084678 at C:\openjdk\jdk7u\hotspot\src\share\vm\prims\jvm.cpp:1244
Event: 1.217 Thread 0x00000000026c7000 Threw 0x000000067c08c7b0 at C:\openjdk\jdk7u\hotspot\src\share\vm\prims\jvm.cpp:1244
Event: 1.218 Thread 0x00000000026c7000 Threw 0x000000067c0934a8 at C:\openjdk\jdk7u\hotspot\src\share\vm\prims\jvm.cpp:1244
Event: 1.218 Thread 0x00000000026c7000 Threw 0x000000067c09cac8 at C:\openjdk\jdk7u\hotspot\src\share\vm\prims\jvm.cpp:1244
Events (10 events):
Event: 1.215 loading class 0x0000000008956520 done
Event: 1.216 loading class 0x0000000008959de0
Event: 1.216 loading class 0x0000000008959de0 done
Event: 1.217 loading class 0x0000000008ea91d0
Event: 1.217 loading class 0x0000000008ea91d0 done
Event: 1.218 loading class 0x0000000008d643e0
Event: 1.218 loading class 0x0000000008d643e0 done
Event: 1.218 loading class 0x0000000008958d20
Event: 1.218 loading class 0x0000000008958d20 done
Event: 1.219 Executing VM operation: RedefineClasses
Dynamic libraries:
0x000000013f7c0000 - 0x000000013f7f1000 S:\OpenJDK\bin\javaw.exe
0x0000000076ef0000 - 0x0000000077099000 C:\WINDOWS\SYSTEM32\ntdll.dll
0x0000000076cb0000 - 0x0000000076dcf000 C:\WINDOWS\system32\kernel32.dll
0x000007fefcde0000 - 0x000007fefce4b000 C:\WINDOWS\system32\KERNELBASE.dll
0x0000000074990000 - 0x0000000074a19000 C:\WINDOWS\System32\SYSFER.DLL
0x000007feff0f0000 - 0x000007feff1cb000 C:\WINDOWS\system32\ADVAPI32.dll
0x000007fefed40000 - 0x000007fefeddf000 C:\WINDOWS\system32\msvcrt.dll
0x000007fefe660000 - 0x000007fefe67f000 C:\WINDOWS\SYSTEM32\sechost.dll
0x000007fefefc0000 - 0x000007feff0ed000 C:\WINDOWS\system32\RPCRT4.dll
0x0000000076df0000 - 0x0000000076eea000 C:\WINDOWS\system32\USER32.dll
0x000007fefe270000 - 0x000007fefe2d7000 C:\WINDOWS\system32\GDI32.dll
0x000007fefe360000 - 0x000007fefe36e000 C:\WINDOWS\system32\LPK.dll
0x000007fefe370000 - 0x000007fefe43a000 C:\WINDOWS\system32\USP10.dll
0x000007fefb460000 - 0x000007fefb654000 C:\WINDOWS\WinSxS\amd64_microsoft.windows.common-controls_6595b64144ccf1df_6.0.7601.18837_none_fa3b1e3d17594757\COMCTL32.dll
0x000007fefee30000 - 0x000007fefeea1000 C:\WINDOWS\system32\SHLWAPI.dll
0x000007fefe630000 - 0x000007fefe65e000 C:\WINDOWS\system32\IMM32.DLL
0x000007fefeeb0000 - 0x000007fefefb9000 C:\WINDOWS\system32\MSCTF.dll
0x0000000052420000 - 0x00000000524f2000 S:\OpenJDK\jre\bin\msvcr100.dll
0x0000000051c90000 - 0x000000005241e000 S:\OpenJDK\jre\bin\server\jvm.dll
0x000007fef82b0000 - 0x000007fef82b9000 C:\WINDOWS\system32\WSOCK32.dll
0x000007fefede0000 - 0x000007fefee2d000 C:\WINDOWS\system32\WS2_32.dll
0x000007feff1d0000 - 0x000007feff1d8000 C:\WINDOWS\system32\NSI.dll
0x000007fefaca0000 - 0x000007fefacdb000 C:\WINDOWS\system32\WINMM.dll
0x00000000770c0000 - 0x00000000770c7000 C:\WINDOWS\system32\PSAPI.DLL
0x000007feece20000 - 0x000007feece2f000 S:\OpenJDK\jre\bin\verify.dll
0x000007fee0320000 - 0x000007fee0348000 S:\OpenJDK\jre\bin\java.dll
0x000007fee6ee0000 - 0x000007fee6f03000 S:\OpenJDK\jre\bin\instrument.dll
0x000007fee06e0000 - 0x000007fee06f5000 S:\OpenJDK\jre\bin\zip.dll
0x000007fef1f90000 - 0x000007fef20b5000 C:\WINDOWS\system32\dbghelp.dll
VM Arguments:
jvm_args: -javaagent:S:\.m2\org\jmockit\jmockit\1.21\jmockit-1.21.jar -Dfile.encoding=ISO-8859-1
java_command: org.eclipse.jdt.internal.junit.runner.RemoteTestRunner -version 3 -port 61294 -testLoaderClass org.eclipse.jdt.internal.junit4.runner.JUnit4TestLoader -loaderpluginname org.eclipse.jdt.junit4.runtime -classNames com.examples.JMockitTest
java_command: org.eclipse.jdt.internal.junit.runner.RemoteTestRunner -version 3 -port 61294 -testLoaderClass org.eclipse.jdt.internal.junit4.runner.JUnit4TestLoader -loaderpluginname org.eclipse.jdt.junit4.runtime -classNames com.examples.JMockitTest
Launcher Type: SUN_STANDARD
Environment Variables:
JRE_HOME=C:\Program Files (x86)\IBM\RationalSDLC\Common\Java5.0\jre
PATH=C:\Python27\;C:\Python27\Scripts;C:\Program Files (x86)\IBM\RationalSDLC\Clearquest\cqcli\bin;C:\PERL51001\Perl\site\bin;C:\PERL51001\Perl\bin;C:\Program Files (x86)\RSA SecurID Token Common;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\Program Files (x86)\Microsoft Application Virtualization Client;C:\Program Files (x86)\java\jre6\bin\;C:\Perl64\bin;C:\Program Files (x86)\Perforce;C:\Program Files (x86)\IBM\RationalSDLC\ClearCase\bin;C:\Program Files (x86)\IBM\RationalSDLC\common;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\Program Files\TortoiseGit\bin;C:\Program Files\Microsoft SQL Server\110\Tools\Binn\;C:\Program Files (x86)\Microsoft SDKs\TypeScript\1.0\;C:\Program Files\Microsoft SQL Server\120\Tools\Binn\;C:\Program Files\nodejs\
Thanks

Simple YARN benchmark TestDFSIO fails

I've setup hadoop on a two node cluster. The first node "namenode" runs the following daemons:
hadoop#namenode:~$ jps
2916 SecondaryNameNode
2692 NameNode
3159 NodeManager
5834 Jps
2771 DataNode
3076 ResourceManager
The seconds node "datanode" runs the following daemons:
hadoop#datanode:~$ jps
2559 Jps
2087 DataNode
2198 NodeManager
In the /etc/hosts file I added on BOTH machines:
10.240.40.246 namenode
10.240.172.201 datanode
which are the corresponding ips and I check that I can ssh to any other machine from each machine. Now, I wanted to test my hadoop installation by performing a sample map reduce benchmark job:
hadoop#namenode:~$ hadoop jar /opt/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.2.0-tests.jar TestDFSIO -write -nrFiles 20 -fileSize 10
However this job fails:
14/02/17 22:22:53 INFO fs.TestDFSIO: TestDFSIO.1.7
14/02/17 22:22:53 INFO fs.TestDFSIO: nrFiles = 20
14/02/17 22:22:53 INFO fs.TestDFSIO: nrBytes (MB) = 10.0
14/02/17 22:22:53 INFO fs.TestDFSIO: bufferSize = 1000000
14/02/17 22:22:53 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO
14/02/17 22:22:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/02/17 22:22:55 INFO fs.TestDFSIO: creating control file: 10485760 bytes, 20 files
14/02/17 22:22:56 INFO fs.TestDFSIO: created control files for: 20 files
14/02/17 22:22:56 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/02/17 22:22:56 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/02/17 22:22:57 INFO mapred.FileInputFormat: Total input paths to process : 20
14/02/17 22:22:57 INFO mapreduce.JobSubmitter: number of splits:20
14/02/17 22:22:57 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
14/02/17 22:22:57 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
14/02/17 22:22:57 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
14/02/17 22:22:57 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
14/02/17 22:22:57 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
14/02/17 22:22:57 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/02/17 22:22:57 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/02/17 22:22:57 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
14/02/17 22:22:57 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
14/02/17 22:22:57 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
14/02/17 22:22:58 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1392675199090_0001
14/02/17 22:22:59 INFO impl.YarnClientImpl: Submitted application application_1392675199090_0001 to ResourceManager at /0.0.0.0:8032
14/02/17 22:22:59 INFO mapreduce.Job: The url to track the job: http://namenode.c.forward-camera-473.internal:8088/proxy/application_1392675199090_0001/
14/02/17 22:22:59 INFO mapreduce.Job: Running job: job_1392675199090_0001
14/02/17 22:23:10 INFO mapreduce.Job: Job job_1392675199090_0001 running in uber mode : false
14/02/17 22:23:10 INFO mapreduce.Job: map 0% reduce 0%
14/02/17 22:23:42 INFO mapreduce.Job: map 20% reduce 0%
14/02/17 22:23:43 INFO mapreduce.Job: map 30% reduce 0%
14/02/17 22:24:14 INFO mapreduce.Job: map 60% reduce 0%
14/02/17 22:24:41 INFO mapreduce.Job: map 60% reduce 20%
14/02/17 22:24:45 INFO mapreduce.Job: map 85% reduce 20%
14/02/17 22:24:48 INFO mapreduce.Job: map 85% reduce 28%
14/02/17 22:24:59 INFO mapreduce.Job: map 90% reduce 28%
14/02/17 22:25:00 INFO mapreduce.Job: map 90% reduce 30%
14/02/17 22:25:02 INFO mapreduce.Job: map 100% reduce 30%
14/02/17 22:25:03 INFO mapreduce.Job: map 100% reduce 100%
14/02/17 22:25:16 INFO mapreduce.Job: map 0% reduce 0%
14/02/17 22:25:16 INFO mapreduce.Job: Job job_1392675199090_0001 failed with state FAILED due to: Application application_1392675199090_0001 failed 2 times due to AM Container for appattempt_1392675199090_0001_000002 exited with exitCode: 1 due to: Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
.Failing this attempt.. Failing the application.
14/02/17 22:25:16 INFO mapreduce.Job: Counters: 0
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.hadoop.fs.TestDFSIO.runIOTest(TestDFSIO.java:443)
at org.apache.hadoop.fs.TestDFSIO.writeTest(TestDFSIO.java:425)
at org.apache.hadoop.fs.TestDFSIO.run(TestDFSIO.java:755)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.fs.TestDFSIO.main(TestDFSIO.java:650)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.test.MapredTestDriver.run(MapredTestDriver.java:115)
at org.apache.hadoop.test.MapredTestDriver.main(MapredTestDriver.java:123)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Having a look into the log file I find on the machine datanode in that:
hadoop#datanode:/opt/hadoop-2.2.0/logs$ cat yarn-hadoop-nodemanager-datanode.log
...
2014-02-17 22:29:33,432 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8031. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
On my namenode I did:
hadoop#namenode:/opt/hadoop-2.2.0/logs$ cat yarn-hadoop-*log
2014-02-17 22:13:20,833 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: STARTUP_MSG:
...
2014-02-17 22:13:25,240 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: The Auxilurary Service named 'mapreduce_shuffle' in the configuration is for class class org.apache.hadoop.mapred.ShuffleHandler which has a name of 'httpshuffle'. Because these are not the same tools trying to send ServiceData and read Service Meta Data may have issues unless the refer to the name in the config.
...
2014-02-17 22:13:25,505 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: NodeManager configured with 8 G physical memory allocated to containers, which is more than 80% of the total physical memory available (3.6 G). Thrashing might happen.
...
2014-02-17 22:24:48,779 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1392675199090_0001_01_000023
2014-02-17 22:24:48,779 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1392675199090_0001_01_000024
...
2014-02-17 22:25:15,733 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1392675199090_0001_02_000001 is : 1
2014-02-17 22:25:15,734 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1392675199090_0001_02_000001 and exit code: 1
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
...
2014-02-17 22:25:15,736 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container exited with a non-zero exit code 1
...
2014-02-17 22:25:15,751 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hadoop OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1392675199090_0001 CONTAINERID=container_1392675199090_0001_02_000001
...
2014-02-17 22:13:19,150 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: STARTUP_MSG:
...
2014-02-17 22:25:15,837 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop OPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE DESCRIPTION=App failed with state: FAILED PERMISSIONS=Application application_1392675199090_0001 failed 2 times due to AM Container for appattempt_1392675199090_0001_000002 exited with exitCode: 1 due to: Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
.Failing this attempt.. Failing the application. APPID=application_1392675199090_0001
However, I checked on machine namenode that port 8031 is listening. I get:
hadoop#namenode:~$ netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 namenode.c.forwar:36975 metadata.google.in:http TIME_WAIT
tcp 0 0 namenode.c.forwar:36969 metadata.google.in:http TIME_WAIT
tcp 0 0 namenode.c.forwar:40616 namenode.c.forwar:10001 TIME_WAIT
tcp 0 0 namenode.c.forwar:36974 metadata.google.in:http ESTABLISHED
tcp 0 0 namenode.c.forward:8031 namenode.c.forwar:41229 ESTABLISHED
tcp 0 352 namenode.c.forward-:ssh e178064245.adsl.a:64305 ESTABLISHED
tcp 0 0 namenode.c.forwar:41229 namenode.c.forward:8031 ESTABLISHED
tcp 0 0 namenode.c.forwar:40365 namenode.c.forwar:10001 ESTABLISHED
tcp 0 0 namenode.c.forwar:10001 namenode.c.forwar:40365 ESTABLISHED
tcp 0 0 namenode.c.forwar:10001 datanode:48786 ESTABLISHED
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags Type State I-Node Path
unix 10 [ ] DGRAM 4604 /dev/log
unix 2 [ ] STREAM CONNECTED 10490
unix 2 [ ] STREAM CONNECTED 10488
unix 2 [ ] STREAM CONNECTED 10452
unix 2 [ ] STREAM CONNECTED 8452
unix 2 [ ] STREAM CONNECTED 7800
unix 2 [ ] STREAM CONNECTED 7797
unix 2 [ ] STREAM CONNECTED 6762
unix 2 [ ] STREAM CONNECTED 6702
unix 2 [ ] STREAM CONNECTED 6698
unix 2 [ ] STREAM CONNECTED 6208
unix 2 [ ] DGRAM 5750
unix 2 [ ] DGRAM 5737
unix 2 [ ] DGRAM 5734
unix 3 [ ] STREAM CONNECTED 5643
unix 3 [ ] STREAM CONNECTED 5642
unix 2 [ ] DGRAM 5640
unix 2 [ ] DGRAM 5192
unix 2 [ ] DGRAM 5171
unix 2 [ ] DGRAM 4889
unix 2 [ ] DGRAM 4723
unix 2 [ ] DGRAM 4663
unix 3 [ ] DGRAM 3132
unix 3 [ ] DGRAM 3131
So, what could be the problem here. In my opinion everything is setup fine. Why is my job failing then?
The log on the datanode says
Retrying connect to server: 0.0.0.0/0.0.0.0:8031
So it trys to connect to this port on the local machine which is datanode. However, the service runs on namenode. Therefore one has to add the following config lines to yarn-site.xml
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>namenode:8031</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>namenode:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>namenode:8030</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>namenode:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>namenode:8088</value>
</property>
where namenode is an alias in /etc/hosts for the machine that runs the resource manager daemon.
Also add the same properties in the yarn-site.xml file on the namenode to ensure that these services connect to the same ports.