I just downloaded Spark 2.2 from the website, and created a simple project with the example from here.
The code is this:
import java.util.Properties
import org.apache.spark
object MysqlTest {
def main(args: Array[String]) {
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:mysql://localhost/hap")
.option("dbtable", "hap.users")
.option("user", "***")
.option("password", "***")
.load()
}
}
The problem is that apparently spark.read does not exist.
I guess the Spark API's documentation is not up to date and the examples do not work. I would appreciate a working example.
I think you need this :
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Yo bro")
.getOrCreate()
The docs should be correct, but you skipped over the strt where the initialization is explained.https://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sparksession
The convention whith spark docs is the spark is a SparkSession instance, so that needs to be created first. You do this with the SparkSessionBuilder.
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
Related
I am trying a POC on kafka where I am loading a dataset to a topic and reading from it. I am trying to create a struct as follow to apply to the data that I will read from kafka topic:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types.{MapType, StringType, StructField, StructType}
import org.apache.spark.sql.functions._
//import org.apache.spark.sql.types.DataType.j
//import org.json4s._
//import org.json4s.na
val df = spark
.read
.format("kafka")
.options(Admin.commonOptions)
.option("subscribe", topic)
.load()
df.printSchema()
val personStringDF = df.selectExpr("CAST(value AS STRING)")
println("personStringDF--")
personStringDF.show()
personStringDF.printSchema()
val schemaTopic: StructType = StructType(
Array(
StructField(name = "col1", dataType = StringType, nullable = false),
StructField(name = "col2", dataType = StringType, nullable = false)
))
My BUILD file :
java_library(
name = "spark",
exports = [
"#maven//:org_apache_spark_spark_core_2_12",
"#maven//:org_apache_spark_spark_sql_2_12",
"#maven//:org_apache_spark_spark_unsafe_2_12",
"#maven//:org_apache_spark_spark_tags_2_12",
"#maven//:org_apache_spark_spark_catalyst_2_12",
"#maven//:com_fasterxml_jackson_core_jackson_annotations",
"#maven//:com_fasterxml_jackson_core_jackson_core",
"#maven//:com_fasterxml_jackson_core_jackson_databind",
"#maven//:com_typesafe_play_play_json_2_12_2_9_1",
"#maven//:org_json4s_json4s_ast_2_12_4_0_0",
"#maven//:org_json4s_json4s_jackson_2_12_4_0_0"
,
],
)
but getting Exception in thread "main" java.lang.NoClassDefFoundError: org/json4s/JsonAST$JValue
Can anybody help here not sure why am I getting this?
(running this code with Bazel I have a workspace file as well all these dependencies mentioned there this is runtime error bazel build is successful)
This issue is resolved by downgrading version of JSON dependencies to
3.6.6 in WORKSPACE file
"org.json4s:json4s-ast_2.12:3.6.6",
"org.json4s:json4s-core_2.12:3.6.6",
"org.json4s:json4s-jackson_2.12:3.6.6",
"org.json4s:json4s-scalap_2.12:3.6.6",
You missed some dependent json4s libs.
Bazel require to explicitly enumerate all needed dependencies in build file.
I have a big JSON which is want to use in Spark Structured Streaming. I don't want to re-type this JSON as Spark schema expression manually. Can I do this automatically once?
I wrote this
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Infer Schema") \
.getOrCreate()
df = spark \
.read \
.option("multiline", True) \
.json("file_examples/dataflow/row01.json")
df.printSchema()
df.show()
with open("dataflow_schema.json", "w") as fp:
fp.write(df.schema.json())
Is this ok?
You are on the right path. You may save your schema as a json and then load it later. Be sure to convert it to json and then a StructType before use
import json
from pyspark.sql.types import StructType
with open("dataflow_schema.json", "r") as fp:
json_schema_str = fp.read()
my_schema = StructType.fromJson(json.loads(json_schema_str))
In your structured streaming query if you have a json column you may use the from_json method to convert your json to a struct type and eventually several columns eg:
from pyspark.sql.functions import from_json,col
# Assume that we have a kafkaStream
kafkaStream.selectExpr("CAST(value as string)")\
.select(from_json(col("value"),my_schema).alias("json_value"))\
.selectExpr("json_value.*") # extract as columns
I know 2 ways to import a CSV file in PySpark:
1) I can use SparkSession. Here is my full code in Jupyter Notebook.
from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Spark Session 1').getOrCreate()
df = spark.read.csv('mtcars.csv', header = True)
2) I can use the Spark-CSV module from Databricks.
from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header = 'true', inferschema = 'true').load('mtcars.csv')
1) What are the advantages of SparkSession over Spark-CSV?
2) What are the advantages of Spark-CSV over SparkSession?
3) If SparkSession is perfectly capable of importing CSV files, why did Databricks invent the Spark-CSV module?
Let me answer 3rd question first, since 2.0.0 spark csv is embedded. But in older version of spark we have to use spark-csv library. Databricks invented spark-csv at the early stage(1.3+).
To address your 1st and 2nd question,
it's kind of spark 1.6 vs 2.0+ comparison. You will get all the feature provided by spark-csv + spark 2.0 feature if you use SparkSession. If you use spark-csv then you will loose those features.
Hope this helps.
I am trying to use SparkSession to convert JSON data of a file to RDD with Spark Notebook. I already have the JSON file.
val spark = SparkSession
.builder()
.appName("jsonReaderApp")
.config("config.key.here", configValueHere)
.enableHiveSupport()
.getOrCreate()
val jread = spark.read.json("search-results1.json")
I am very new to spark and do not know what to use for config.key.here and configValueHere.
SparkSession
To get all the "various Spark parameters as key-value pairs" for a SparkSession, “The entry point to programming Spark with the Dataset and DataFrame API," run the following (this is using Spark Python API, Scala would be very similar).
import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
SparkConf().getAll()
or without importing SparkConf:
spark.sparkContext.getConf().getAll()
Depending on which API you are using, see one of the following:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/SparkSession.html
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SparkSession.html
You can get a deeper level list of SparkSession configuration options by running the code below. Most are the same, but there are a few extra ones. I am not sure if you can change these.
spark.sparkContext._conf.getAll()
SparkContext
To get all the "various Spark parameters as key-value pairs" for a SparkContext, the "Main entry point for Spark functionality," ... "connection to a Spark cluster," ... and "to create RDDs, accumulators and broadcast variables on that cluster,” run the following.
import pyspark
from pyspark import SparkConf, SparkContext
spark_conf = SparkConf().setAppName("test")
spark = SparkContext(conf = spark_conf)
SparkConf().getAll()
Depending on which API you are using, see one of the following:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html
Spark parameters
You should get a list of tuples that contain the "various Spark parameters as key-value pairs" similar to the following:
[(u'spark.eventLog.enabled', u'true'),
(u'spark.yarn.appMasterEnv.PYSPARK_PYTHON', u'/<yourpath>/parcels/Anaconda-4.2.0/bin/python'),
...
...
(u'spark.yarn.jars', u'local:/<yourpath>/lib/spark2/jars/*')]
Depending on which API you are using, see one of the following:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkConf.html
https://spark.apache.org/docs/latest//api/python/reference/api/pyspark.SparkConf.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkConf.html
For a complete list of Spark properties, see:
http://spark.apache.org/docs/latest/configuration.html#viewing-spark-properties
Setting Spark parameters
Each tuple is ("spark.some.config.option", "some-value") which you can set in your application with:
SparkSession
spark = (
SparkSession
.builder
.appName("Your App Name")
.config("spark.some.config.option1", "some-value")
.config("spark.some.config.option2", "some-value")
.getOrCreate())
sc = spark.sparkContext
SparkContext
spark_conf = (
SparkConf()
.setAppName("Your App Name")
.set("spark.some.config.option1", "some-value")
.set("spark.some.config.option2", "some-value"))
sc = SparkContext(conf = spark_conf)
spark-defaults
You can also set the Spark parameters in a spark-defaults.conf file:
spark.some.config.option1 some-value
spark.some.config.option2 "some-value"
then run your Spark application with spark-submit (pyspark):
spark-submit \
--properties-file path/to/your/spark-defaults.conf \
--name "Your App Name" \
--py-files path/to/your/supporting/pyspark_files.zip \
--class Main path/to/your/pyspark_main.py
This is how it worked for me to add spark or hive settings in my scala:
{
val spark = SparkSession
.builder()
.appName("StructStreaming")
.master("yarn")
.config("hive.merge.mapfiles", "false")
.config("hive.merge.tezfiles", "false")
.config("parquet.enable.summary-metadata", "false")
.config("spark.sql.parquet.mergeSchema","false")
.config("hive.merge.smallfiles.avgsize", "160000000")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("spark.sql.orc.impl", "native")
.config("spark.sql.parquet.binaryAsString","true")
.config("spark.sql.parquet.writeLegacyFormat","true")
//.config(“spark.sql.streaming.checkpointLocation”, “hdfs://pp/apps/hive/warehouse/dev01_landing_initial_area.db”)
.getOrCreate()
}
The easiest way to set some config:
spark.conf.set("spark.sql.shuffle.partitions", 500).
Where spark refers to a SparkSession, that way you can set configs at runtime. It's really useful when you want to change configs again and again to tune some spark parameters for specific queries.
In simple terms, values set in "config" method are automatically propagated to both SparkConf and SparkSession's own configuration.
for eg :
you can refer to
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-settings.html to understand how hive warehouse locations are set for SparkSession using config option
To know about the this api you can refer to : https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/SparkSession.Builder.html
Every Spark config option is expolained at: http://spark.apache.org/docs/latest/configuration.html
You can set these at run-time as in your example above or through the config file given to spark-submit
Using pyspark:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("spark play")\
.getOrCreate()
df = spark.read\
.format("jdbc")\
.option("url", "jdbc:mysql://localhost:port")\
.option("dbtable", "schema.tablename")\
.option("user", "username")\
.option("password", "password")\
.load()
Rather than fetch "schema.tablename", I would prefer to grab the result set of a query.
Same as in 1.x you can pass valid subquery as dbtable argument for example:
...
.option("dbtable", "(SELECT foo, bar FROM schema.tablename) AS tmp")
...