How to map Spark Dataframe rows to a case class array? - json

Let's say I have a dataframe like this:
var df = sqlContext.createDataFrame(Seq((1,"James"),(2,"Anna"))).toDF("id", "name")
My goal is to write it as a valid JSON output. But if I write this to JSON, I get a file with single JSON object on each line and it's not a valid JSON.
Therefore I would like a structure like this:
root
|-- Result: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- name: string (nullable = false)
If I create a case class for a row:
case class ResultRow(id: String, name: String)
I can use it to map the initial dataframe to its objects.
Or if use df.collect(), I get an array of Rows. Then I can create an object of this class:
case class Result(var Result: Array[ResultRow])
And then I can loop over the collected Row objects, map them to ResultRow objects and add to the array in Result class instance. Then I can transform the object to a new dataframe and write it to JSON.
But the problem with this approach is that the collect() gathers all the data to the driver and when I have hundreds of thousands of rows, the runtime would take hours.
I want to get a dataframe with one column of Array type and one row with this array, which contains all rows of the initial dataframe as a single element and do it without using collect().
Any help is appreciated

Related

Avro schema - nested Struct vs Array

I'm reading nested Avro file from PySpark, and trying to understand the schema. My question is what's the differences between the following two types of data? I commented below. I could not quite visualize what each type of data looks like. Could someone please help to explain? Thanks!
--products: array #This seems to be an array of stings like [100, 'com']
|--element: struct
|-- productNumber: string
|-- productType: string
# This is a list of string as well?
# What's the difference between this one and having the schema like the one above?
--productID: struct
|-- IDType: string
|-- description: string
|-- classification: string

pyspark add new nested array to existing json file

I'm new in Spark and have a big problem, which I can't handle, even after hours of searching...
I have a jsonFile which looks like this:
root
|-- dialogueData: struct (nullable = true)
| |-- dialogueID: string (nullable = true)
| |-- dialogueLength: double (nullable = true)
| |-- speakerChanges: long (nullable = true)
|-- snippetlist: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- confidence: double (nullable = true)
| | |-- length: double (nullable = true)
| | |-- role: string (nullable = true)
| | |-- snippetID: string (nullable = true)
| | |-- transcription: string (nullable = true)
| | |-- wordCount: long (nullable = true)
My program does sentiment analysis and returns a dataframe column with predictions (1.0, 0.0, -1.0 etc) and also returns some values like average.
Now my problem:
I want to do two things:
I want to add my for example average value to the first struct "dialogeData"
I want to add my whole column to the array "snippetlist" as a new struct "sentiment", so that for each snippet in the array the correct sentiment appears.
Is that possible? I really dont find any good things about that case, so I really hope somebody can help me.
Thanks a lot!
First, before any other means to add, you would need to do a join so the elements you want to add are in new columns added to the original dataframe.
In any case once you have the relevant dataframe you can write it to json
To create the relevant dataframe you have several options:
The first option (which is easiest if you know scala) is to use scala in which case you can use the dataset API by creating a case class representing the original and target values and converting accordingly (not pretty).
The second option is to convert to RDD and use map to add the relevant data. This would probably be pretty ugly and inefficient.
The third option is to use to_json to convert the entire record to json stringYou can then write a UDF which converts the string to a json of the target (recieve the additional input, convert json to dictionary, update the dictionary and convert back to json). The resulting string can then be converted back to dataframe info by using the from_json function.
The fourth option would be to use dataframe options. The idea is that you can flatten a struct by using select("structName.*") and then you can recreated it using struct(col1, col2, ...).
To add an element to the array you would first need to posexplode it (this would create a row from each element in the array having one column for the position and one for the value), then flatten it, then add the element (by using the getItem function to get the relevant position) and convert back to struct and collect_list.

SparkR, split a column of nested JSON strings into columns

I am coming from R, new to SparkR, and trying to split a SparkDataFrame column of JSON strings into respective columns. The columns in the Spark DataFrame are arrays with a schema like this:
> printSchema(tst)
root
|-- FromStation: array (nullable = true)
| |-- element: string (containsNull = true)
|-- ToStation: array (nullable = true)
| |-- element: string (containsNull = true)
If I look at the data in the viewer, View(head(tst$FromStation)) I can see the SparkDataFrame's FromStation column has a form like this in each row:
list("{\"Code\":\"ABCDE\",\"Name\":\"StationA\"}", "{\"Code\":\"WXYZP\",\"Name\":\"StationB\"}", "{...
Where the ... indicates the pattern repeats an unknown amount of times.
My Question
How do I extract this information and put it in a flat dataframe? Ideally, I would like to make a FromStationCode and FromStationName column for each observation in the nested array column. I have tried various combinations of explode and getItem...but to no avail. I keep getting a data type mismatch error. I've searched through examples of other people with this challenge in Spark, but SparkR examples are more scarce. I'm hoping someone with more experience using Spark/SparkR could provide some insight.
Many thanks,
nate
I guess you need to convert tst into usual R object
df = collect(tst)
Then you operate with df like with any other R data.frame

Spark Dataframe schema definition using reflection with case classes and column name aliases

I ran into a little problem with my Spark Scala script. Basically I have raw data which I am doing aggregations on and after grouping and counting etc I want to save the output to a specific JSON format.
EDIT:
I tried to simplify the question and rewrote it:
When I select data from the source dataframe with an Array[org.apache.spark.sql.Column] where the column names have aliases, then using column names (or indeed indices) as variables when trying to map the rows to a case class, then I get a "Task not serializable" exception.
var dm = sqlContext.createDataFrame(Seq((1,"James"),(2,"Anna"))).toDF("id", "name")
val cl = dm.columns
val cl2 = cl.map(name => col(name).as(name.capitalize))
val dm2 = dm.select(cl2:_*)
val n = "Name"
case class Result(Name:String)
val r = dm2.map(row => Result(row.getAs(n))).toDF
And the second part or the question, I actually need the final schema to be an array of these Result class objects. I still haven't figured out, how to do this as well. The expected result should have a schema like that:
case class Test(var FilteredStatistics: Array[Result])
val t = Test(Array(Result("Anna"), Result("James")))
val t2 = sc.parallelize(Seq(t)).toDF
scala> t2.printSchema
root
|-- FilteredStatistics: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Name: string (nullable = true)
TL;DR:
How to map dataframe rows to a case class object when dataframe columns have aliases and variables are used for column names?
How to add these case class objects to an array?
Serialization Issue: the problem here is the val n = "Name": it is used inside an anonymous function passed into an RDD transformation (dm2.map(...)), which makes Spark close over that variable and the scope containing it, which also includes cl2 which has the type Array[Column], hence it isn't serializable.
The solution is simple - either inline n (to get dm2.map(row => Result(row.getAs("Name")))), or place it in a Serializable context (an object or a class that doesn't contain any non-serializable members).

How to let Spark parse a JSON-escaped String field as a JSON Object to infer the proper structure in DataFrames?

I have as input a set of files formatted as a single JSON object per line. The problem, however, is that one field on these JSON objects is a JSON-escaped String. Example
{
"id":1,
"name":"some name",
"problem_field": "{\"height\":180,\"weight\":80,}",
}
Expectedly, when using sqlContext.read.json it will create a DataFrame with with the 3 columns id, name and problem_field where problem_field is a String.
I have no control over the input files and I'd prefer to be able to solve this problem within Spark so, Is there any way where I can get Spark to read that String field as JSON and to infer its schema properly?
Note: the json above is just a toy example, the problem_field in my case would have variable different fields and it would be great for Spark to infer these fields and me not having to make any assumptions about what fields exist.
Would that be acceptable solution?
val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)
val escapedJsons: RDD[String] = sc.parallelize(Seq("""{"id":1,"name":"some name","problem_field":"{\"height\":180,\"weight\":80}"}"""))
val unescapedJsons: RDD[String] = escapedJsons.map(_.replace("\"{", "{").replace("\"}", "}").replace("\\\"", "\""))
val dfJsons: DataFrame = sqlContext.read.json(unescapedJsons)
dfJsons.printSchema()
// Output
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- problem_field: struct (nullable = true)
| |-- height: long (nullable = true)
| |-- weight: long (nullable = true)