Error timeout when I want to read file from hadoop with pyspark - csv

I want to read a csv file from hadoop with Pyspark with the following code:
dfcsv = spark.read.csv("hdfs://my_hadoop_cluster_ip:9000/user/root/input/test.csv")
dfcsv.printSchema()
My cluster hadoop is on a Docker container on my local machine and link with two other slave container for the workers.
As you see in this picture from my ui hadoop cluster, the path is the right path.
But when I submit my script with this command :
spark-submit --master spark://my_cluster_spark_ip:7077 test.py
My script stuck on the read, and after few minutes I have this following error :
22/02/09 15:42:29 WARN TaskSetManager: Lost task 0.1 in stage 4.0 (TID 4) (my_slave_spark_ip executor 1): org.apache.hadoop.net.ConnectTimeoutException: Call From spark-slave1/my_slave_spark_ip to my_hadoop_cluster_ip:9000 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=my_hadoop_cluster_ip/my_hadoop_cluster_ip:9000]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
...
For information, my csv file is very small, just 3 lines and 64 KB.
Have you any solution to fix this issue?

Related

Py4JError while converting csv file to parquet using jupyter-notebook

I want to convert a csv to parquet file using jupyter notebook, python3. However, i get the next error:
Py4JJavaError Traceback (most recent call last)
Py4JJavaError: An error occurred while calling o40.parquet.
: org.apache.spark.SparkException: Job aborted.
at ...…...…..
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows.
at
Caused by: org.apache.spark.SparkException: Task failed while writing rows.
Caused by: java.net.SocketException: Connection reset by peer: socket write error
How can I resolve it please?
Make sure you have hadoop binaries available and HADOOP_HOME is set
If not download them from here
Then set HADOOP_HOME
import os
os.environ['HADOOP_HOME']=r"C:\hadoop-2.7.1"
os.environ["JAVA_HOME"] = r"C:\Program Files\Java\jdk1.8.0_212"
Then save the file

org.apache.thrift.transport.TTransportException error while Reading large JSON file in zeppelin scala

I am trying to read a large JSON file (1.5 GB) using Zeppelin and Scala.
Zeppelin is working on SPARK in local mode installed on Ubuntu OS on a VM with 10 GB RAM. I have alloted 8GB to the spark.executor.memory
My Code is as below
val inputFileWeather="/home/shashi/incubator-zeppelin-master/data/ai/weather.json"
val temp=sqlContext.read.json(inputFileWeather)
I am getting the following error
org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:241)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:225)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:229)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:229)
at org.apache.zeppelin.scheduler.Job.run(Job.java:171)
at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:328)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
The error you got is due to a problem in running the Spark interpreter, so Zeppelin could not connect with the interpreter process.
You have to check your logs located in /PATH/TO/ZEPPELIN/logs/*.out to know exactly what happening. Perhaps in the interpreter logs you will see an OOM.
I think that 8GB for executor memory on a VM with 10 GB is a unreasonable,(and how many executors are you starting?). You have to consider the driver memeory as well
Increase the driver memory in the pyspark interpreter i.e. spark.driver.memory. By default its 1G

Multi nodes hadoop cluster configuration

I'm new to Hadoop cluster and trying to deploy a multi-node cluster on ubuntu 15.10 with ONE master and TWO slaves. After configuration, there's TWO active nodes(two slaves). However, when I tried hadoop example program below
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar pi 3 100
I got an error of connection refused:
Job job_1459774851310_0001 failed with state FAILED due to: Application application_1459774851310_0001 failed 2 times due to Error launching appattempt_1459774851310_0001_000002. Got exception: java.net.ConnectException: Call From ubuntu/127.0.1.1 to ubuntu:36380 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
To deploy this cluster, I disabled ipv6 in all machines, and edited configuration files as following shows:
In file core-site.xml:
fs.defaultFS = hdfs://master:8020
In file hdfs-site.xml:
dfs.namenode.name.dir = $HADOOP_PREFIX/namenode
dfs.datanode.data.dir = $HADOOP_PREFIX/datanode
In file yarn-site.xml:
yarn.resourcemanager.address = master:8084
yarn.resourcemanager.schedular.address = master:8085
yarn.resourcemanager.resource-tracker.address = master:8086
yarn.resourcemanager.admin.address = master:8087
yarn.resourcemanager.webapp.address = master:8088
yarn.nodemanager.aux-services = mapreduce_shuffle
In file mapred_site.xml
mapreduce.framework.name = yarn
mapreduce.jobhistory.address = master:10020
mapreduce.jobhistory.address = master:19888
Those 4 files are the same on all machines.
Where did I make mistakes? How to fix it?
In file slaves, I wrote only the IP addresses of two slaves.

My account on Cosmos global instance seems to be running out of space - maybe need to increase quota

Trying to run a simple hdfs query failed with:
[ms#cosmosmaster-gi ~]$ hadoop fs -ls /user/ms/def_serv/def_servpath
Java HotSpot(TM) 64-Bit Server VM warning: Insufficient space for shared memory file:
/tmp/hsperfdata_ms/21066
Try using the -Djava.io.tmpdir= option to select an alternate temp location.
Exception in thread "main" java.lang.NoClassDefFoundError: ___/tmp/hsperfdata_ms/21078
Caused by: java.lang.ClassNotFoundException: ___.tmp.hsperfdata_ms.21078
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: ___/tmp/hsperfdata_ms/21078. Program will exit.
Any idea how to fix that or increase quota?
Thanks!
ms
Your quota has not been exceeded (see command below), but this was a problem with the cluster. It should be fixed now.
$ hadoop fs -dus /user/ms
hdfs://cosmosmaster-gi/user/ms 90731

Jenkins/Sonar - new error after server restart

Quick question about sonar/jenkins integration.
First, a little background - we're working on implementing the build pipeline plugin, and last night we had an issue where one of our pipeline jobs lost its history. This took out all of our executors (even on the slaves). I tried renaming and I tried bouncing and that didn't work. Finally, I brought down the master, cleared everything from temp and work in Tomcat, and brought it back up. This fixed the issues with my executors.
So this morning I run a build that runs Sonar as a post-build step. Now I am seeing this error:
[ERROR] Failed to execute goal org.codehaus.mojo:sonar-maven-plugin:2.0:sonar (default-cli) on project VendorProduct: Can not execute Sonar: The current batch process and the configured remote server do not share the same DB configuration.
[ERROR] - Batch side: jdbc:oracle:thin:#dansrzl105si.wellsfargo.com:3203/LBLDFRI1 (QMTDO / *****)
[ERROR] - Server side: check the configuration at http://lpwva3279:9000/sonar/system
[ERROR] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.codehaus.mojo:sonar-maven-plugin:2.0:sonar (default-cli) on project VendorProduct: Can not execute Sonar
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
at org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
Caused by: org.apache.maven.plugin.MojoExecutionException: Can not execute Sonar
at org.codehaus.mojo.sonar.Bootstraper.executeMojo(Bootstraper.java:118)
at org.codehaus.mojo.sonar.Bootstraper.start(Bootstraper.java:65)
at org.codehaus.mojo.sonar.SonarMojo.execute(SonarMojo.java:90)
at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:101)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209)
... 19 more
Caused by: org.sonar.core.persistence.BadDatabaseVersion: The current batch process and the configured remote server do not share the same DB configuration.
- Batch side: jdbc:oracle:thin:#dansrzl105si.wellsfargo.com:3203/LBLDFRI1 (QMTDO / *****)
- Server side: check the configuration at http://lpwva3279:9000/sonar/system
[ERROR]
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
I've verified the configuration against the sonar configuration file, deleted and recreated the post-build step, deleted and recreated the sonar instance on the Jenkins configuration page. I'm at a loss here. Any one have any suggestions?
Thanks!
This message is printed out when a Sonar batch connects to a database which is not the same as the one configured on the Sonar Web server.
In your case, if you go to http://lpwva3279:9000/sonar/system, there are chances that the DB config settings that you'll find are not the same as jdbc:oracle:thin:#dansrzl105si.wellsfargo.com:3203/LBLDFRI1 (QMTDO / *****). You should then go to your Jenkins settings and update the information relating to Sonar to match the one found on the Sonar Web server.
Go to http://host:port/sonar/system:
Copy value "Database URL" into Jenkins configuration ("Database URL")
Copy value "Database Login" into Jenkins configuration ("Database Login")
Copy value "Database Driver" into Jenkins configuration ("Database Driver")
And go to SONAR_HOME/conf/sonar.properties:
Copy value "Database password" into Jenkins configuration ("Database password")