Py4JError while converting csv file to parquet using jupyter-notebook - csv

I want to convert a csv to parquet file using jupyter notebook, python3. However, i get the next error:
Py4JJavaError Traceback (most recent call last)
Py4JJavaError: An error occurred while calling o40.parquet.
: org.apache.spark.SparkException: Job aborted.
at ...…...…..
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows.
at
Caused by: org.apache.spark.SparkException: Task failed while writing rows.
Caused by: java.net.SocketException: Connection reset by peer: socket write error
How can I resolve it please?

Make sure you have hadoop binaries available and HADOOP_HOME is set
If not download them from here
Then set HADOOP_HOME
import os
os.environ['HADOOP_HOME']=r"C:\hadoop-2.7.1"
os.environ["JAVA_HOME"] = r"C:\Program Files\Java\jdk1.8.0_212"
Then save the file

Related

Error timeout when I want to read file from hadoop with pyspark

I want to read a csv file from hadoop with Pyspark with the following code:
dfcsv = spark.read.csv("hdfs://my_hadoop_cluster_ip:9000/user/root/input/test.csv")
dfcsv.printSchema()
My cluster hadoop is on a Docker container on my local machine and link with two other slave container for the workers.
As you see in this picture from my ui hadoop cluster, the path is the right path.
But when I submit my script with this command :
spark-submit --master spark://my_cluster_spark_ip:7077 test.py
My script stuck on the read, and after few minutes I have this following error :
22/02/09 15:42:29 WARN TaskSetManager: Lost task 0.1 in stage 4.0 (TID 4) (my_slave_spark_ip executor 1): org.apache.hadoop.net.ConnectTimeoutException: Call From spark-slave1/my_slave_spark_ip to my_hadoop_cluster_ip:9000 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=my_hadoop_cluster_ip/my_hadoop_cluster_ip:9000]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
...
For information, my csv file is very small, just 3 lines and 64 KB.
Have you any solution to fix this issue?

Apache Drill 1.17.0 on Windows 10 - Trouble Getting Drill to Run (Embedded Mode)

Details:
Apache Drill 1.17.0
Windows 10 64 bit
Java JDK1.8.0_241
New installation. Unable to get Apache Drill to load successfully.
Command line: c:\Users\floodb\Software\Drill\apache-drill-1.17.0\bin>drill-embedded
Error Received: Error: Failure in starting embedded Drillbit: UNSUPPORTED_OPERATION ERROR: Failure while attempting to load instance of the class of type org.apache.drill.exec.store.StoragePluginRegistry requested at path drill.exec.storage.registry.
[Error Id: 7c1b33eb-7a27-4e39-af06-5ba22e5ffae6 ] (state=,code=0)
java.sql.SQLException: Failure in starting embedded Drillbit: UNSUPPORTED_OPERATION ERROR: Failure while attempting to load instance of the class of type org.apache.drill.exec.store.StoragePluginRegistry requested at path drill.exec.storage.registry.
There is no 'hadoop_home' environment variable set (as suggested by other posts on StackOverflow).
Partial Log:
2020-02-19 15:55:42,315 [main] INFO
o.a.drill.common.util.GuavaPatcher - Google's Stopwatch patched for
old HBase Guava version. 2020-02-19 15:55:42,319 [main] INFO
o.a.drill.common.util.GuavaPatcher - Google's Closeables patched for
old HBase Guava version. 2020-02-19 15:55:42,333 [main] INFO
o.a.drill.common.util.GuavaPatcher - Google's Preconditions were
patched to hold new methods. 2020-02-19 15:55:42,693 [main] INFO
o.a.drill.common.config.DrillConfig - Configuration and plugin file(s)
identified in 32ms. Base Configuration:
- jar:file:/C:/Users/floodb/Software/Drill/apache-drill-1.17.0/jars/drill-common-1.17.0.jar!/drill-default.conf
(Bunch of log lines deleted)
2020-02-19 15:55:45,134 [main] INFO o.a.d.c.s.persistence.ScanResult
- loading 22 classes for org.apache.drill.common.logical.data.LogicalOperator took 4ms
2020-02-19 15:55:45,138 [main] INFO o.a.d.c.s.persistence.ScanResult
- loading 12 classes for org.apache.drill.common.logical.StoragePluginConfig took 3ms
2020-02-19 15:55:45,146 [main] INFO o.a.d.c.s.persistence.ScanResult
- loading 15 classes for org.apache.drill.common.logical.FormatPluginConfig took 7ms 2020-02-19
15:55:45,179 [main] INFO o.a.drill.common.config.DrillConfig - User
Error Occurred: Failure while attempting to load instance of the class
of type org.apache.drill.exec.store.StoragePluginRegistry requested at
path drill.exec.storage.registry. (null)
org.apache.drill.common.exceptions.UserException:
UNSUPPORTED_OPERATION ERROR: Failure while attempting to load instance
of the class of type org.apache.drill.exec.store.StoragePluginRegistry
requested at path drill.exec.storage.registry.
[Error Id: 7c1b33eb-7a27-4e39-af06-5ba22e5ffae6 ] at
org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:637)
at
org.apache.drill.common.config.DrillConfig.getInstance(DrillConfig.java:92)
at
org.apache.drill.exec.server.DrillbitContext.(DrillbitContext.java:113)
at org.apache.drill.exec.work.WorkManager.start(WorkManager.java:116)
at org.apache.drill.exec.server.Drillbit.run(Drillbit.java:221) at
org.apache.drill.jdbc.impl.DrillConnectionImpl.(DrillConnectionImpl.java:134)
at
org.apache.drill.jdbc.impl.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:67)
at
org.apache.drill.jdbc.impl.DrillFactory.newConnection(DrillFactory.java:67)
at
org.apache.calcite.avatica.UnregisteredDriver.connect(UnregisteredDriver.java:138)
at org.apache.drill.jdbc.Driver.connect(Driver.java:75) at
sqlline.DatabaseConnection.connect(DatabaseConnection.java:135) at
sqlline.DatabaseConnection.getConnection(DatabaseConnection.java:192)
at sqlline.Commands.connect(Commands.java:1364) at
sqlline.Commands.connect(Commands.java:1244) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
sqlline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.java:38)
at sqlline.SqlLine.dispatch(SqlLine.java:730) at
sqlline.SqlLine.initArgs(SqlLine.java:410) at
sqlline.SqlLine.begin(SqlLine.java:515) at
sqlline.SqlLine.start(SqlLine.java:267) at
sqlline.SqlLine.main(SqlLine.java:206) Caused by:
java.lang.reflect.InvocationTargetException: null at
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.drill.common.config.DrillConfig.getInstance(DrillConfig.java:88)
... 22 common frames omitted Caused by:
java.lang.UnsatisfiedLinkError:
org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native
Method) at
org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:645)
at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1230) at
org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1435) at
org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:493)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
at
org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:678)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
at
org.apache.drill.exec.store.dfs.DrillFileSystem.listStatus(DrillFileSystem.java:563)
at
org.apache.drill.exec.util.FileSystemUtil.listNonRecursive(FileSystemUtil.java:224)
at
org.apache.drill.exec.util.FileSystemUtil.list(FileSystemUtil.java:209)
at
org.apache.drill.exec.util.FileSystemUtil.listFiles(FileSystemUtil.java:104)
at
org.apache.drill.exec.util.DrillFileSystemUtil.listFiles(DrillFileSystemUtil.java:86)
at
org.apache.drill.exec.store.sys.store.LocalPersistentStore.getRange(LocalPersistentStore.java:121)
at
org.apache.drill.exec.store.sys.BasePersistentStore.getAll(BasePersistentStore.java:27)
at
org.apache.drill.exec.store.StoragePluginRegistryImpl.initPluginsSystemTable(StoragePluginRegistryImpl.java:277)
at
org.apache.drill.exec.store.StoragePluginRegistryImpl.(StoragePluginRegistryImpl.java:90)
... 27 common frames omitted 2020-02-19 15:55:46,199 [main] INFO
o.apache.drill.exec.server.Drillbit - Shutdown completed (1018 ms).
The problem was that the 32 bit version of the Java JDK was installed. If you are having this problem, check to make sure that the 64 bit version of Java is installed.

com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown column 'RECEPTORS.r_name' in 'field list'

I am using play 2.3.8 framework to create an API and access mariaDB. When I ran the query on mariaDB console, it works OK but when I run it from play I get error that the field RECEPTORS.r_name is not available which is not true.
My code is
package models.dao
import anorm._
import models.Profile
import play.api.db.DB
import play.api.Play.current
object ProfileDAO {
def index(r_name: String): List[Profile] = {
DB.withConnection { implicit c =>
val results = SQL(
"""
| SELECT `RECEPTORS.r_name`,`RECEPTORS.pdbCode`, `LIGANDS.l_id`, `LIGANDS.l_score`
| FROM `RECEPTORS`
| INNER JOIN `LIGANDS`
| WHERE `RECEPTORS.r_name`={r_name};
""".stripMargin).on(
"r_name" -> r_name
).apply()
results.map { row =>
Profile(row[String]("r_name"), row[String]("pdbCode"),row[String]("l_id"),row[Double]("l_score"))
}.force.toList
}
}
}
Query that I ran on mariaDB console is
SELECT RECEPTORS.r_name, pdbCode, l_id, l_score FROM RECEPTORS INNER JOIN LIGANDS WHERE RECEPTORS.r_name="receptor";
Error which running with Play 2.3.8 is as follows
laeeq#optiplex:~/Desktop/Backup/Project5/cpvsAPI$ sbt -jvm-debug 9999
run Listening for transport dt_socket at address: 9999 [info] Loading
project definition from
/home/laeeq/Desktop/Backup/Project5/cpvsAPI/project [info] Set current
project to cpvsAPI (in build
file:/home/laeeq/Desktop/Backup/Project5/cpvsAPI/) [info] Updating
{file:/home/laeeq/Desktop/Backup/Project5/cpvsAPI/}root... [info]
Resolving jline#jline;2.11 ... [info] Done updating.
--- (Running the application, auto-reloading is enabled) ---
[info] play - Listening for HTTP on /0:0:0:0:0:0:0:0:9000
(Server started, use Ctrl+D to stop and go back to the console...)
SLF4J: The following set of substitute loggers may have been accessed
SLF4J: during the initialization phase. Logging calls during this
SLF4J: phase were not honored. However, subsequent logging calls to
these SLF4J: loggers will work as normally expected. SLF4J: See also
http://www.slf4j.org/codes.html#substituteLogger SLF4J:
org.webjars.WebJarExtractor [info] Compiling 1 Scala source to
/home/laeeq/Desktop/Backup/Project5/cpvsAPI/target/scala-2.11/classes...
[info] play - database [default] connected at
jdbc:mysql://localhost:3306/db_profile [info] play - Application
started (Dev) [error] application -
! #766oc7b8l - Internal server error, for (GET) [/profiles/receptor]
->
play.api.Application$$anon$1: Execution
exception[[MySQLSyntaxErrorException: Unknown column
'RECEPTORS.r_name' in 'field list']] at
play.api.Application$class.handleError(Application.scala:296)
~[play_2.11-2.3.8.jar:2.3.8] at
play.api.DefaultApplication.handleError(Application.scala:402)
[play_2.11-2.3.8.jar:2.3.8] at
play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$14$$anonfun$apply$1.applyOrElse(PlayDefaultUpstreamHandler.scala:205)
[play_2.11-2.3.8.jar:2.3.8] at
play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$14$$anonfun$apply$1.applyOrElse(PlayDefaultUpstreamHandler.scala:202)
[play_2.11-2.3.8.jar:2.3.8] at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
[scala-library-2.11.1.jar:na] Caused by:
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown
column 'RECEPTORS.r_name' in 'field list' at
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
~[na:1.8.0_151] at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
~[na:1.8.0_151] at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
~[na:1.8.0_151] at
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
~[na:1.8.0_151] at
com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
~[mysql-connector-java-5.1.18.jar:na]
You need to quote the table and column individually, so use:
`RECEPTORS`.`r_name`
otherwise MySQL thinks you are trying to reference a column with the name RECEPTORS.r_name in some implicit table.
You need to do this for all your (quoted) column references. Specifically in this case, quoting seems to be unnecessary, so you could also just use RECEPTORS.r_name without backticks.

My account on Cosmos global instance seems to be running out of space - maybe need to increase quota

Trying to run a simple hdfs query failed with:
[ms#cosmosmaster-gi ~]$ hadoop fs -ls /user/ms/def_serv/def_servpath
Java HotSpot(TM) 64-Bit Server VM warning: Insufficient space for shared memory file:
/tmp/hsperfdata_ms/21066
Try using the -Djava.io.tmpdir= option to select an alternate temp location.
Exception in thread "main" java.lang.NoClassDefFoundError: ___/tmp/hsperfdata_ms/21078
Caused by: java.lang.ClassNotFoundException: ___.tmp.hsperfdata_ms.21078
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: ___/tmp/hsperfdata_ms/21078. Program will exit.
Any idea how to fix that or increase quota?
Thanks!
ms
Your quota has not been exceeded (see command below), but this was a problem with the cluster. It should be fixed now.
$ hadoop fs -dus /user/ms
hdfs://cosmosmaster-gi/user/ms 90731

How to see my application's exception in hdinsight

How to see my application's exception in hdinsight?
I created an hadoop stream job, when I run my job, it fails with
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 255
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
As far as I know, this is because my code has some bug and throws an exception out and then crash, how can I get the exception infomation? Is there a application log or something for hdinsight?
RDP to the head node (you will have to enable remote to the hdinsight cluster) and click on the yarn UI shortcut on the desktop. This will show the task logs.