How to import a packge from a local jar in pyspark? - csv

I am using pyspark to do some work on a csv file, hence I need to import package from spark-csv_2.10-1.4.0.jar downloaded from https://repo1.maven.org/maven2/com/databricks/spark-csv_2.11/1.4.0/spark-csv_2.11-1.4.0.jar
I downloaded the jar to my local due to proxy issue.
Can anyone tell me what is the right usage of referring to a local jar:
Here is the code I use:
pyspark --jars /home/rx52019/data/spark-csv_2.10-1.4.0.jar
it will take me to the pyspark shell as expected, however, when I run:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',inferschema='true').load('hdfs://dev-icg/user/spark/routes.dat')
the route.dat is uploaded to hdfs already at hdfs://dev-icg/user/spark/routes.dat
It gives me error:
: java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat
If I run:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',inferschema='true').load('routes.dat')
I get this error:
py4j.protocol.Py4JJavaError: An error occurred while calling o72.load.
: java.lang.NoClassDefFoundError: Could not initialize class
com.databricks.spark.csv.package$
Can anyone help to sort it out for me? Thank you very much. Any clue is appreciated.

The correct way to do this would be to add the options (say if you are starting a spark shell)
spark-shell --packages com.databricks:spark-csv_2.11:1.4.0 --driver-class-path /path/to/csvfilejar.jar
I have not used the databricks csvjar directly, but I used a netezza connector to spark where they mention using this option
https://github.com/SparkTC/spark-netezza

Related

Databrick csv cannot find local file

In a program I have csv extracted from excel, I need to upload the csv to hdfs and save it as parquet format, doesn't matter with python version or spark version, no scala please.
Almost all discussions I came across are about databrick, however, it seems cannot find the file, here is the code and error:
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema","true").option("delimiter",",").load("file:///home/rxie/csv_out/wamp.csv")
Error:
java.io.FileNotFoundException: File file:/home/rxie/csv_out/wamp.csv
does not exist
The file path:
ls -la /home/rxie/csv_out/wamp.csv
-rw-r--r-- 1 rxie linuxusers 2896878 Nov 12 14:59 /home/rxie/csv_out/wamp.csv
Thank you.
I found the issue now!
The reason why it errors out of file not found is actually correct, because I was using Spark Context with setMaster("yarn-cluster"), that means all worker nodes will look for the csv file, of course all worker nodes (except the one starting the program where the csv resides) do not have this file and hence error out. What I really should do is to use setMaster("local").
FIX:
conf = SparkConf().setAppName('test').setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
csv = "file:///home/rxie/csv_out/wamp.csv"
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema","true").option("delimiter",",").load(csv)
Yes, you are right, the file should be present at all worker nodes.
well. you can still read a local file in yarn cluster mode. you just need to add your file using addFile.
spark.sparkContext.addFile("file:///your local file path ")
spark will copy the file to each node where executor will be created and can be able to process your file in cluster mode as well.
I am using spark 2.3 version so you can change your spark context accordingly but addFile method remains same.
try this with your yarn (cluster mode) and let me know if it works for you.

Can't import file on Neo4j with RuntimeException error

I have updated the neo4j.conf file but can't seem to get rid of this error after changing the file and restarting. I am just trying to load a json file through neo4j and have included the line apoc.import.file.enabled=true on the neo4j.conf but doesn't seem to be working for me, I'm still getting the error message:
Failed to invoke procedure 'apoc.load.json' Caused by
java.lang.RuntimeException : Import from files not enabled, please set
apoc.import.file.enabled=true in your neo4j.conf
I am using neo4jCE 3.2.3 and have used the right file path for the json file as it previously worked on my desktop computer (I'm just trying to replicate it on my laptop) and I am using apoc 3.2.0.4 version plugin. The procedure apoc.load.json is also there when I call all procedures directory.

Adding Spark CSV dependency to Zeppelin

I'm running an EMR with a spark cluster on AWS.
Spark version is 1.6
When running the folllowing command:
proxy = sqlContext.read.load("/user/zeppelin/ProxyRaw.csv",
format="com.databricks.spark.csv",
header="true",
inferSchema="true")
I get the following error:
Py4JJavaError: An error occurred while calling o162.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at
http://spark-packages.org
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
How can I solve this? I assume I should add a package but how do I install it and where?
There is many way to add packages in Zeppelin :
One of them is to actually change the conf/zeppelin-env.sh configuration file adding the packages you need e.g com.databricks:spark-csv_2.10:1.4.0 in your case to the submit options since Zeppelin uses the spark-submit command under the hood :
export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.10:1.4.0"
But let's say that you don't have actually access to those configuration. You can then use Dynamic Dependency Loading via %dep interpreter (deprecated) :
%dep
z.load("com.databricks:spark-csv_2.10:1.4.0")
This will require that you load the dependencies before launching or restarting the interpreter.
Another way to do it is do add the dependency you need via the interpreter dependency manager as described in the following link : Dependency Management for Interpreter.
Well,
First you need to download the CSV liv from Maven repository:
https://mvnrepository.com/artifact/com.databricks/spark-csv_2.10/1.5.0
Check the scala version that you are using. If is 2.10 or 2.11.
When you call spark-shell our spark-submit or pyspark. Or even your Zeppelin you need to add the option --jars and the path to your lib.
Like this:
pyspark --jars /path/to/jar/spark-csv_2.10-1.5.0.jar
Than you can call it as you did above.
You can see other close issue here: How to add third party java jars for use in pyspark

ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver

I am using shared node cluster
Hadoop 2.5.0-cdh5.3.2
Please share the names all compatible version of MySql jar files to be loaded and all the path folders for any successful import and export between HDFS and MySQL.
I am currently getting below error message
ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
so far, I have loaded MySql connector and hadoop jar files in /usr/lib/sqoop/lib
mysql-connector-java-5.0.8-bin
hadoop-core-1.0.3
Please let me know if I need to add more files and specify the path
Is t
If it is a version incompatibility issue, then you can give a try to mysql-connector-java-5.1.31.jar as I am using mysql-connector-java-5.1.31.jar with sqoop version 1.4.5. For me, it works for both data import and export use cases.

How to use Spark DataFrame with MySQL

OK, I have known that I could use jdbc connector to create DataFrame with this command:
val jdbcDF = sqlContext.load("jdbc",
Map("url" -> "jdbc:mysql://localhost:3306/video_rcmd?user=root&password=123456",
"dbtable" -> "video"))
But I got this error: java.sql.SQLException: No suitable driver found for ...
And I have tried to add jdbc jar to spark_path with both commands but failed:
spark-shell --jars mysql-connector-java-5.0.8-bin.jar
SPARK_CLASSPATH=mysql-connector-java-5.0.8-bin.jar spark-shell
My Spark version is 1.3.0 while Class.forName("com.mysql.jdbc.Driver").newInstance is worked.
It is caused because the data frame does the find the the Mysql Connector Jar in the class path. This can be resolved by adding the jar to the spark class path as below:
Edit /spark/bin/compute-classpath.sh as
CLASSPATH="$CLASSPATH:$ASSEMBLY_JAR:yourPathToJar/mysql-connector-java-5.0.8-bin.jar"
Save the file and Restart the spark.
You might want to try mysql-connector-java-5.1.29-bin.jar