I have many JSON files inside a folder. All of them have the same structure. Now I want to create the DataFrame, and each JSON file should be the row of this DataFrame.
I know how to create DataFrame based on a single JSON string, but I don't know how to deal with multiple ones:
import spark.implicits._
val jsonStr = """{ "key": 111, "value": 54, stamp: "aaa"}"""
val df = spark.read.json(Seq(jsonStr).toDS)
Assuming you have your JSONs in folder src/main/resources
Following code will produce desired result:
private val df: DataFrame = spark.read.json("src/main/resources")
df.show()
+---+-----+-----+
|key|stamp|value|
+---+-----+-----+
|111| aaa| 54|
|111| aaa| 54|
+---+-----+-----+
Note that JSON should be machine-readable, not human readable (that means that JSONs shouldn't have new line characters.
Related
I have Json file as follows:
{"columns": ['Name', 'City', 'DOB'],
"data":[
['ABC', 'Georgia', '01/05/1987'],
['ABC', 'Kansas', '10/11/1989']]}
How can this file be processed properly using pyspark so that it will get loaded into table containing name, city and dob in a proper format. Will I have to first transform it into the usual json format and then proceed ahead with json load or is there any other way to handle this
I tried with the below approach and it works. Not sure if its the best way, I'm open for suggestions or improvements.
data = {"columns": ['Name', 'City', 'DOB'],
"data":[
['ABC', 'Georgia', '01/05/1987'],
['ABC', 'Kansas', '10/11/1989']
]
}
# Converting to json string and then reading it into a pandas dataframe
import json
import pandas as pd
json_data = json.dumps(data)
pandasDF = pd.read_json(json_data, orient='split')
pandasDF
# Converting pandas dataframe to spark dataframe as below -
df = spark.createDataFrame(pandasDF)
df.show(truncate=False)
+----+-------+----------+
|Name| City| DOB|
+----+-------+----------+
| ABC|Georgia|01/05/1987|
| ABC| Kansas|10/11/1989|
+----+-------+----------+
I am writing Spark Application in scala which reads the HiveTable and save the output in HDFS as Json Format file.
I read the hive table using HiveContext and it returns the DataFrame. Below is the code snippet.
val sparkConf = new SparkConf().setAppName("SparkReadHive")
val sc = new SparkContext(sparkConf)
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val df = sqlContext.sql(
"""
|SELECT *
|FROM database.table
|""".stripMargin)
df.write.format("json").save(path)
I need output file looks like below:
[{"name":"tom", "age": 8},
{"name":"Jerry", "age": 7}]
However, what I get is like below:
{"name":"tom", "age": 8}
{"name":"Jerry", "age": 7}
Can someone please help me with it? Thank you!
We can use .toJSON, collect() and .mkString method to get array of json objects and by using hadoop filesystem to create a file in hdfs with the desired format.
Example:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
import java.io._
//sample dataframe
val df=sc.parallelize(Seq(("tom",8),("Jerry",7))).toDF("name","age")
//making array of json object
val data=df.toJSON.collect().mkString("[",",\n","]")
//filesystem object
val path = new Path("hdfs://<namenode>:8020/<path>/myfile.txt")
val conf = new Configuration(sc.hadoopConfiguration)
val fs = path.getFileSystem(conf)
if (fs.exists(path))
fs.delete(path, true)
val out = new BufferedOutputStream(fs.create(path))
out.write(data.getBytes("UTF-8"))
out.flush()
out.close()
fs.close()
Check contents of file in HDFS:
hadoop fs -cat myfile.txt
[{"name":"tom","age":8},
{"name":"Jerry","age":7}]
I am trying to use Spark for processing JSON data with variable structure(nested JSON). Input JSON data could be very large with more than 1000 of keys per row and one batch could be more than 20 GB.
Entire batch has been generated from 30 data sources and 'key2' of each JSON can be used to identify the source and structure for each source is predefined.
What would be the best approach for processing such data?
I have tried using from_json like below but it works only with fixed schema and to use it first I need to group the data based on each source and then apply the schema.
Due to large data volume my preferred choice is to scan the data only once and extract required values from each source, based on predefined schema.
import org.apache.spark.sql.types._
import spark.implicits._
val data = sc.parallelize(
"""{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1"}}"""
:: Nil)
val df = data.toDF
val schema = (new StructType)
.add("key1", StringType)
.add("key2", StringType)
.add("key3", (new StructType)
.add("key3_k1", StringType))
df.select(from_json($"value",schema).as("json_str"))
.select($"json_str.key3.key3_k1").collect
res17: Array[org.apache.spark.sql.Row] = Array([xxx])
This is just a restatement of #Ramesh Maharjan's answer, but with more modern Spark syntax.
I found this method lurking in DataFrameReader which allows you to parse JSON strings from a Dataset[String] into an arbitrary DataFrame and take advantage of the same schema inference Spark gives you with spark.read.json("filepath") when reading directly from a JSON file. The schema of each row can be completely different.
def json(jsonDataset: Dataset[String]): DataFrame
Example usage:
val jsonStringDs = spark.createDataset[String](
Seq(
("""{"firstname": "Sherlock", "lastname": "Holmes", "address": {"streetNumber": 121, "street": "Baker", "city": "London"}}"""),
("""{"name": "Amazon", "employeeCount": 500000, "marketCap": 817117000000, "revenue": 177900000000, "CEO": "Jeff Bezos"}""")))
jsonStringDs.show
jsonStringDs:org.apache.spark.sql.Dataset[String] = [value: string]
+----------------------------------------------------------------------------------------------------------------------+
|value
|
+----------------------------------------------------------------------------------------------------------------------+
|{"firstname": "Sherlock", "lastname": "Holmes", "address": {"streetNumber": 121, "street": "Baker", "city": "London"}}|
|{"name": "Amazon", "employeeCount": 500000, "marketCap": 817117000000, "revenue": 177900000000, "CEO": "Jeff Bezos"} |
+----------------------------------------------------------------------------------------------------------------------+
val df = spark.read.json(jsonStringDs)
df.show(false)
df:org.apache.spark.sql.DataFrame = [CEO: string, address: struct ... 6 more fields]
+----------+------------------+-------------+---------+--------+------------+------+------------+
|CEO |address |employeeCount|firstname|lastname|marketCap |name |revenue |
+----------+------------------+-------------+---------+--------+------------+------+------------+
|null |[London,Baker,121]|null |Sherlock |Holmes |null |null |null |
|Jeff Bezos|null |500000 |null |null |817117000000|Amazon|177900000000|
+----------+------------------+-------------+---------+--------+------------+------+------------+
The method is available from Spark 2.2.0:
http://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.DataFrameReader#json(jsonDataset:org.apache.spark.sql.Dataset[String]):org.apache.spark.sql.DataFrame
If you have data as you mentioned in the question as
val data = sc.parallelize(
"""{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1"}}"""
:: Nil)
You don't need to create schema for json data. Spark sql can infer schema from the json string. You just have to use SQLContext.read.json as below
val df = sqlContext.read.json(data)
which will give you schema as below for the rdd data used above
root
|-- key1: string (nullable = true)
|-- key2: string (nullable = true)
|-- key3: struct (nullable = true)
| |-- key3_k1: string (nullable = true)
And you can just select key3_k1 as
df2.select("key3.key3_k1").show(false)
//+-------+
//|key3_k1|
//+-------+
//|key3_v1|
//+-------+
You can manipulate the dataframe as you wish. I hope the answer is helpful
I am not sure if my suggestion can help you although I had a similar case and I solved it as follows:
1) So the idea is to use json rapture (or some other json library) to
load JSON schema dynamically. For instance you could read the 1st
row of the json file to discover the schema(similarly to what I do
here with jsonSchema)
2) Generate schema dynamically. First iterate through the dynamic
fields (notice that I project values of key3 as Map[String, String])
and add a StructField for each one of them to schema
3) Apply the generated schema into your dataframe
import rapture.json._
import jsonBackends.jackson._
val jsonSchema = """{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1", "key3_k2":"key3_v2", "key3_k3":"key3_v3"}}"""
val json = Json.parse(jsonSchema)
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.{StringType, StructType}
val schema = ArrayBuffer[StructField]()
//we could do this dynamic as well with json rapture
schema.appendAll(List(StructField("key1", StringType), StructField("key2", StringType)))
val items = ArrayBuffer[StructField]()
json.key3.as[Map[String, String]].foreach{
case(k, v) => {
items.append(StructField(k, StringType))
}
}
val complexColumn = new StructType(items.toArray)
schema.append(StructField("key3", complexColumn))
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val sparkConf = new SparkConf().setAppName("dynamic-json-schema").setMaster("local")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val jsonDF = spark.read.schema(StructType(schema.toList)).json("""your_path\data.json""")
jsonDF.select("key1", "key2", "key3.key3_k1", "key3.key3_k2", "key3.key3_k3").show()
I used the next data as input:
{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v11", "key3_k2":"key3_v21", "key3_k3":"key3_v31"}}
{"key1":"val2","key2":"source2","key3":{"key3_k1":"key3_v12", "key3_k2":"key3_v22", "key3_k3":"key3_v32"}}
{"key1":"val3","key2":"source3","key3":{"key3_k1":"key3_v13", "key3_k2":"key3_v23", "key3_k3":"key3_v33"}}
And the output:
+----+-------+--------+--------+--------+
|key1| key2| key3_k1| key3_k2| key3_k3|
+----+-------+--------+--------+--------+
|val1|source1|key3_v11|key3_v21|key3_v31|
|val2|source2|key3_v12|key3_v22|key3_v32|
|val2|source3|key3_v13|key3_v23|key3_v33|
+----+-------+--------+--------+--------+
An advanced alternative, which I haven't tested yet, would be to generate a case class e.g called JsonRow from the JSON schema in order to have a strongly typed dataset which provides better serialization performance apart the fact that make your code more maintainable. To make this work you need first to create a JsonRow.scala file then you should implement a sbt pre-build script which will modify the content of JsonRow.scala(you might have more than one of course) dynamically based on your source files. To generate class JsonRow dynamically you can use the next code:
def generateClass(members: Map[String, String], name: String) : Any = {
val classMembers = for (m <- members) yield {
s"${m._1}: String"
}
val classDef = s"""case class ${name}(${classMembers.mkString(",")});scala.reflect.classTag[${name}].runtimeClass"""
classDef
}
The method generateClass accepts a map of strings to create the class members and the class name itself. The members of the generated class you can again populate them from you json schema:
import org.codehaus.jackson.node.{ObjectNode, TextNode}
import collection.JavaConversions._
val mapping = collection.mutable.Map[String, String]()
val fields = json.$root.value.asInstanceOf[ObjectNode].getFields
for (f <- fields) {
(f.getKey, f.getValue) match {
case (k: String, v: TextNode) => mapping(k) = v.asText
case (k: String, v: ObjectNode) => v.getFields.foreach(f => mapping(f.getKey) = f.getValue.asText)
case _ => None
}
}
val dynClass = generateClass(mapping.toMap, "JsonRow")
println(dynClass)
This prints out:
case class JsonRow(key3_k2: String,key3_k1: String,key1: String,key2: String,key3_k3: String);scala.reflect.classTag[JsonRow].runtimeClass
Good luck
I have a DataFrame df is the result of some pre-processing. The size of df is around 10,000 rows.
I save this DataFrame in CSV as follows:
df.coalesce(1).write.option("sep",";").option("header","true").csv("output/path")
Now I want to save this DataFrame as txt file in which is row is a JSON string. So, the column names should be passed to attributes in JSON strings.
For example:
df =
col1 col2 col3
aa 34 55
bb 13 77
json_txt =
{"col1": "aa", "col2": "34", "col3": "55"}
{"col1": "bb", "col2": "13", "col3": "77"}
Which is the best way to do it?
You can use write.json api to save a dataframe in json format as
df.coalesce(1).write.json("output path of json file")
Above code would create a json file. But if you want a text format (json text) then you can use toJSON api as
df.toJSON.rdd.coalesce(1).saveAsTextFile("output path to text file")
I hope the answer is helpful
I imported data from bigquery using the following code on Pyspark:
table_data = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
The output is an RDD framework but has the data in a json format:
[(0, u'{"colA":"Value1,Value4"}'), (52, u'{"colA":"Value2"}')]
I need to extract all the Values in the RDD format. A main concern being the resulting RDD should not contain double quotes for each record.
Required:
Value1,Value4
Value2
and not:
"Value1,Value4"
"Value2"
From what I understood from your question this is what you are looking for:
import json
data = sc.parallelize([(0, u'{"colA":"Value1,Value4"}'), (52, u'{"colA":"Value2"}')])
data = data.map(lambda x: (json.loads(x[1])['colA']))
print(data.collect())
Results:
['Value1,Value4', 'Value2']
Possibly load it with json module:
import json
table_data.map(lambda t: json.loads(t[1]).get("colA")).collect()
# [u'Value1,Value4', u'Value2']