I am upgrading to spark 2 from 1.6 and am having an issue reading in CSV files. In spark 1.6 I would have something like this to read in a CSV file.
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.load(fileName)
Now I use the following code as given in the documentation:
val df = spark.read
.option("header", "true")
.csv(fileName)
This results in the following error when running:
"Exception in thread "main" java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name."
I assume this is because I still had the spark-csv dependency, however I removed that dependency and rebuilt the application and I still get the same error. How is the databricks dependency still being found once I have removed it?
The error message means you have --packages com.databricks:spark-csv_2.11:1.5.0 option while you run spark-shell or have those jars in your class path. Please check your class path and remove that.
I didn't add any jars to my class path.
I use this to load csv file in spark shell(2.3.1).
val df = spark.sqlContext.read.csv('path')
Related
I have written a small code for custom producer in Kafka using scala and it is giving the below error. I have attached the code in code section. I have attached some code for reference.
Name: Compile Error
Message: <console>:61: error: not found: type KafkaProducer
val producer = new KafkaProducer[String, String](props)
^
I think I need to import a relevant package. I tried importing the packages but could not get the correct one.
val producer = new KafkaProducer[String, String](props)
for( i <- 1 to 10) {
//producer.send(new ProducerRecord[String, String]("jin", "test",
"test"));
val record = new ProducerRecord("jin", "key", "the end ")
producer.send(record)
I can't install a scala kernel for jupyter right now, but based on this github you should add Kafka as a dependency, then the library might be recognized
%%configure -f
{
"conf": {
"spark.jars.packages": "org.apache.spark:spark-streaming_2.11:2.1.0,org.apache.bahir:spark-streaming-twitter_2.11:2.1.0,org.apache.spark:spark-streaming-kafka-0-8_2.10:2.1.0,com.google.code.gson:gson:2.4",
"spark.jars.excludes": "org.scala-lang:scala-reflect,org.apache.spark:spark-tags_2.11"
}
}
If this doesn't work, try downloading the whole notebook from the git, and fire it yourself, to see if something else is needed
#Arthur , Magic command %%configure -f did not work in jupyter notebook. I tried downloading the Whole notebook from the git but that also does not work. Luckily I was
reading the apache toree documentation for adding the dependencies and found a command %%addDeps. After putting dependencies in the below format into jupyter notebook,
I managed to run the code.
%AddDeps org.apache.kafka kafka-clients 1.0.0
%AddDeps org.apache.spark spark-core_2.11 2.3.0
Just for the information of others, when we compile the code using SBT, we need to comment this code from jupyter notebook as we will add these in build.sbt file.
Thanks Arthur for showing the direction !
I am trying to follow a tutorial in SparkR. I follow the setup as required. But as soon as I try the function "read.json(path)" I get the following error:
"Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)..."
I am running R 3.3.2 and Java JDK 1.8 as requested in the tutorial.
I attach images of the code and the results:
Is my Java being found, is it the right version?
The image is in R studio showing the code on the left and the console result on the right:
Solution:
The spark-submit or sparkR instance is there.
Using the hdfs//...path put the json file on the Hadoop hdfs:
hadoop-2.0.2\bin> hadoop fs -put "/example/../people.json" "/user/../people.json"
Then use
people <- read.df (sqlContext, "/user/../people.json","json")
to read the json and create dataframe 'people'.
Above steps worked for me after I made necessary changes in the example dataframe.R.
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import play.api.libs.json._
import java.util.Date
import javax.xml.bind.DatatypeConverter
object Test {
def main(args:Array[String]): Unit = {
val logFile="test.txt"
val conf=new SparkConf().setAppName("Json Test")
val sc = new SparkContext(conf)
try {
val out= "output/test"
val logData=sc.textFile(logFile,2).map(line => Json.parse(cleanTypo(line))).cache()
} finally {
sc.stop()
}
}
Since it was said about the Spark jackson conflict problem, I have rebuilt Spark using
mvn versions:use-latest-versions -Dincludes=org.codehaus.jackson:jackson-core-asl
mvn versions:use-latest-versions -Dincludes=org.codehaus.jackson:jackson-mapper-asl
So the jars have been updated to 1.9.x
But I still have the error
15/03/02 03:12:19 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass
at org.codehaus.jackson.map.introspect.JacksonAnnotationIntrospector.findDeserializationType(JacksonAnnotationIntrospector.java:524)
at org.codehaus.jackson.map.deser.BasicDeserializerFactory.modifyTypeByAnnotation(BasicDeserializerFactory.java:732)
at org.codehaus.jackson.map.deser.BeanDeserializerFactory.createBeanDeserializer(BeanDeserializerFactory.java:427)
at org.codehaus.jackson.map.deser.StdDeserializerProvider._createDeserializer(StdDeserializerProvider.java:398)
at org.codehaus.jackson.map.deser.StdDeserializerProvider._createAndCache2(StdDeserializerProvider.java:307)
at org.codehaus.jackson.map.deser.StdDeserializerProvider._createAndCacheValueDeserializer(StdDeserializerProvider.java:287)
at org.codehaus.jackson.map.deser.StdDeserializerProvider.findValueDeserializer(StdDeserializerProvider.java:136)
at org.codehaus.jackson.map.deser.StdDeserializerProvider.findTypedValueDeserializer(StdDeserializerProvider.java:157)
at org.codehaus.jackson.map.ObjectMapper._findRootDeserializer(ObjectMapper.java:2468)
at org.codehaus.jackson.map.ObjectMapper._readValue(ObjectMapper.java:2383)
at org.codehaus.jackson.map.ObjectMapper.readValue(ObjectMapper.java:1094)
at play.api.libs.json.JacksonJson$.parseJsValue(JsValue.scala:477)
at play.api.libs.json.Json$.parse(Json.scala:16)
We hit almost the exact same issue. We were trying to use 1.9.2 but hit a no such method error as well.
Annoyingly there is not only 1 version conflict to deal with but 2. First of all Spark depends on Hadoop (for hdfs) which depends on a 1.8.x build of the jackson json and this is the conflict which you are seeing. Spark (at least 1.2+) then uses the jackson 2.4.4 core which actually got moved to com.fasterxml.jackson.core so it does not actually conflict with 1.8.x due to the different package names.
So in your case your code should work if you do 1 of 3 things:
upgrade to 2.4.x build that is LESS THAN OR EQUAL TO 2.4.4 since the actual dependency will be replaced by sparks which is 2.4.4 (at the time of writing this)
downgrade to 1.8.x that is LESS THAN OR EQUAL TO the 1.8.x build which hadoop is using
compile spark under your 1.9.x build. I know you mention this and that it didn't work but when we tried it was successful and we ran the build with the option -Dcodehaus.jackson.version=1.9.2
There are going to be a lot more issues like this to come unfortunately due to the nature of spark and how it already has all of its own internal dependencies on the classpath so any job dependencies that conflict will never work out. Spark already does some dependency shading to avoid this issue with packages like guava but this is not currently done with jackson.
I am attempting to use the Eclipse Scala-IDE for 2.11 (downloaded the prepackaged bundle from the web-site). I have been using the Scala Worksheet to work with a SaaS API returning JSON. I've been pushing through just using String methods. I decided to begin using json4s. I went to http://mvnrepository.com/ and obtained the following libraries:
json4s-core-2.11-3.2.10.jar
json4s-native-2.11-3.2.10.jar
paranamer-2.6.jar
I have added all three jars to the Project's Build Path. And they appear under the project's "Referenced Libraries".
I have the following code in a Scala Worksheet:
package org.public_domain
import org.json4s._
import org.json4s.native.JsonMethods._
object WorkSheet6 {
println("Welcome to the Scala worksheet")
parse(""" { "numbers" : [1, 2, 3, 4] } """)
println("Bye")
}
I am receiving the following two compilation errors:
bad symbolic reference to org.json4s.JsonAST.JValue encountered in class file 'package.class'. Cannot access type JValue in value org.json4s.JsonAST. The current classpath may be missing a definition for org.json4s.JsonAST.JValue, or package.class may have been compiled against a version that's incompatible with the one found on the current classpath.
bad symbolic reference to org.json4s.JsonAST.JValue encountered in class file 'package.class'. Cannot access type JValue in value org.json4s.JsonAST. The current classpath may be missing a definition for org.json4s.JsonAST.JValue, or package.class may have been compiled against a version that's incompatible with the one found on the current classpath.
When I go look in the org.json4s package in the json4s-core-2.11-3.2.10.jar file, there is in fact no .class file indicating any sort of compiled object JsonAST.
This is a showstopper. Any help on this would be greatly appreciated.
Your classpath is incomplete. You are missing a dependency of json4s-core.
bad symbolic reference to org.json4s.JsonAST.JValue encountered in class file 'package.class'. Cannot access type JValue in value org.json4s.JsonAST. The current classpath may be missing a definition for org.json4s.JsonAST.JValue, or package.class may have been compiled against a version that's incompatible with the one found on the current classpath.
The simplest way to consume Scala or Java libraries is to use sbt or maven. They bring in (transitive) dependencies for you. If you check the pom definition of json4s-core library, you notice it depends on json4s-ast. You should add that jar to your build path as well.
I am trying to work with jerkson in play and with scala 2.10.
However, i want to load data fixtures based on a json files. for this prcoedure I'm trying to load the json with the "parse" command from jerkson.
That ultimatly fails.
I'm doing this in the "override def onStart(app: Application)" function. The error:
NoClassDefFoundError: Could not initialize class com.codahale.jerkson.Json$
Any guesses why this is happening ? I have the following libs in my deps.:
"com.codahale" % "jerkson_2.9.1" % "0.5.0",
"com.cloudphysics" % "jerkson_2.10" % "0.6.3"
my parsing command is:
com.codahale.jerkson.Json.parse[Map[String,Any]](json)
Thanks in advance
A NoClassDefFoundError generally means there is some sort of issues with the classpath. For starters, if you are running on scala 2.10, I would remove the following line from your sbt file:
"com.codahale" % "jerkson_2.9.1" % "0.5.0"
Then, make sure the com.cloudphysics jerkson jar file is available in your apps classpath and try your test again.