I have a big dataframe I cannot infer the schema from. I have a column that could be read as if each value is a json format, but I cannot know the full detail of it (i.e. the keys and values can vary and I do not know what it can be).
I want to read it as a string and work with it, but the format changes in a strange way in the process ; here is an example:
from pyspark.sql.types import *
data = [{"ID": 1, "Value": {"a":12, "b": "test"}},
{"ID": 2, "Value": {"a":13, "b": "test2"}}
]
df = spark.createDataFrame(data)
#change my schema to open the column as string
schema = df.schema
j = schema.jsonValue()
j["fields"][1] = {"name": "Value", "type": "string", "nullable": True, "metadata": {}}
new_schema = StructType.fromJson(j)
df2 = spark.createDataFrame(data, schema=new_schema)
df2.show()
Gives me
+---+---------------+
| ID| Value|
+---+---------------+
| 1| {a=12, b=test}|
| 2|{a=13, b=test2}|
+---+---------------+
As one can see, the format in column Value is now without quotes, and with = instead of : and I cannot work properly with it anymore.
How can I turn that back into a StructType or MapType ?
Assuming this is your input dataframe:
df2 = spark.createDataFrame([
(1, "{a=12, b=test}"), (2, "{a=13, b=test2}")
], ["ID", "Value"])
You can use str_to_map function after removing {} from the string column like this:
from pyspark.sql import functions as F
df = df2.withColumn(
"Value",
F.regexp_replace("Value", "[{}]", "")
).withColumn(
"Value",
F.expr("str_to_map(Value, ', ', '=')")
)
df.printSchema()
#root
# |-- ID: long (nullable = true)
# |-- Value: map (nullable = true)
# | |-- key: string
# | |-- value: string (valueContainsNull = true)
df.show()
#+---+---------------------+
#|ID |Value |
#+---+---------------------+
#|1 |{a -> 12, b -> test} |
#|2 |{a -> 13, b -> test2}|
#+---+---------------------+
I am trying to read a JSON document which looks like this
{"id":100, "name":"anna", "hometown":"chicago"} [{"id":200, "name":"beth", "hometown":"indiana"},{"id":400, "name":"pete", "hometown":"new jersey"},{"id":500, "name":"emily", "hometown":"san fransisco"},{"id":700, "name":"anna", "hometown":"dudley"},{"id":1100, "name":"don", "hometown":"santa monica"},{"id":1300, "name":"sarah", "hometown":"hoboken"},{"id":1600, "name":"john", "hometown":"downtown"}]
{"id":1100, "name":"don", "hometown":"santa monica"} [{"id":100, "name":"anna", "hometown":"chicago"},{"id":400, "name":"pete", "hometown":"new jersey"},{"id":500, "name":"emily", "hometown":"san fransisco"},{"id":1200, "name":"jane", "hometown":"freemont"},{"id":1600, "name":"john", "hometown":"downtown"},{"id":1500, "name":"glenn", "hometown":"uptown"}]
{"id":1400, "name":"steve", "hometown":"newtown"} [{"id":100, "name":"anna", "hometown":"chicago"},{"id":600, "name":"john", "hometown":"san jose"},{"id":900, "name":"james", "hometown":"aurora"},{"id":1000, "name":"peter", "hometown":"elgin"},{"id":1100, "name":"don", "hometown":"santa monica"},{"id":1500, "name":"glenn", "hometown":"uptown"},{"id":1600, "name":"john", "hometown":"downtown"}]
{"id":1500, "name":"glenn", "hometown":"uptown"} [{"id":200, "name":"beth", "hometown":"indiana"},{"id":300, "name":"frank", "hometown":"new york"},{"id":400, "name":"pete", "hometown":"new jersey"},{"id":500, "name":"emily", "hometown":"san fransisco"},{"id":1100, "name":"don", "hometown":"santa monica"}]
There is a space between a key and a value (value is list containing json text).
Code which I tried
data = spark\
.read\
.format("json")\
.load("/Users/sahilnagpal/Desktop/dataworld.json")
data.show()
Result I get
+------------+----+-----+
| hometown| id| name|
+------------+----+-----+
| chicago| 100| anna|
|santa monica|1100| don|
| newtown|1400|steve|
| uptown|1500|glenn|
+------------+----+-----+
Result I want
+------------+----+-----+
| hometown| id| name|
+------------+----+-----+
| chicago| 100| anna| -- all the other ID,name,hometown corresponding to this ID and Name
|santa monica|1100| don| -- all the other ID,name,hometown corresponding to this ID and Name
| newtown|1400|steve| -- all the other ID,name,hometown corresponding to this ID and Name
| uptown|1500|glenn| -- all the other ID,name,hometown corresponding to this ID and Name
+------------+----+-----+
I think instead of reading it as a json file you should try to read it as a text file because the json string does not look like a valid json.
Below is the code that you should try to get the output that you expect:
from pyspark.sql.functions import *
from pyspark.sql.types import *
data1 = spark.read.text("/Users/sahilnagpal/Desktop/dataworld.json")
schema = StructType(
[
StructField('id', StringType(), True),
StructField('name', StringType(), True),
StructField('hometown',StringType(),True)
]
)
data2 = data1.withColumn("JsonKey",split(col("value"),"\\[")[0]).withColumn("JsonValue",split(col("value"),"\\[")[1]).withColumn("data",from_json("JsonKey",schema)).select(col('data.*'),'JsonValue')
Below is the output that you would get based on the above code.
You can read the input as a CSV file using two spaces as the separator/delimiter. Then parse each column separately using from_json with an appropriate schema.
df = spark.read.csv('/Users/sahilnagpal/Desktop/dataworld.json', sep=' ').toDF('json1', 'json2')
df2 = df.withColumn(
'json1',
F.from_json('json1', 'struct<id:int, name:string, hometown:string>')
).withColumn(
'json2',
F.from_json('json2', 'array<struct<id:int, name:string, hometown:string>>')
).select('json1.*', 'json2')
df2.show(truncate=False)
+----+-----+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |name |hometown |json2 |
+----+-----+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100 |anna |chicago |[[200, beth, indiana], [400, pete, new jersey], [500, emily, san fransisco], [700, anna, dudley], [1100, don, santa monica], [1300, sarah, hoboken], [1600, john, downtown]]|
|1100|don |santa monica|[[100, anna, chicago], [400, pete, new jersey], [500, emily, san fransisco], [1200, jane, freemont], [1600, john, downtown], [1500, glenn, uptown]] |
|1400|steve|newtown |[[100, anna, chicago], [600, john, san jose], [900, james, aurora], [1000, peter, elgin], [1100, don, santa monica], [1500, glenn, uptown], [1600, john, downtown]] |
|1500|glenn|uptown |[[200, beth, indiana], [300, frank, new york], [400, pete, new jersey], [500, emily, san fransisco], [1100, don, santa monica]] |
+----+-----+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I was trying to flatten the very nested JSON, and create spark dataframe and the ultimate goal is to push the given dataframe to phoenix. I am successfully able to flatten the JSON using code.
def recurs(df: DataFrame): DataFrame = {
if(df.schema.fields.find(_.dataType match {
case ArrayType(StructType(_),_) | StructType(_) => true
case _ => false
}).isEmpty) df
else {
val columns = df.schema.fields.map(f => f.dataType match {
case _: ArrayType => explode(col(f.name)).as(f.name)
case s: StructType => col(s"${f.name}.*")
case _ => col(f.name)
})
recurs(df.select(columns:_*))
}
}
val df = spark.read.json(json_location)
flatten_df = recurs(df)
flatten_df.show()
My nested json is something like:
{
"Total Value": 3,
"Topic": "Example",
"values": [
{
"value": "#example1",
"points": [
[
"123",
"156"
]
],
"properties": {
"date": "12-04-19",
"value": "Model example 1"
}
},
{"value": "#example2",
"points": [
[
"124",
"157"
]
],
"properties": {
"date": "12-05-19",
"value": "Model example 2"
}
}
]
}
The output I am getting:
+-----------+-----------+----------+-------------+------------------------+------------------------+
|Total Value| Topic |value | points | date | value |
+-----------+-----------+----------+-------------+------------------------+------------------------+
| 3 | Example | example1 | [123,156] | 12-04-19 | Model example 1 |
| 3 | Example | example2 | [124,157] | 12-05-19 | Model example 2 |
+-----------+-----------+----------+-------------+------------------------+------------------------+
So, value key is found 2 times in json so it is creating 2 column name but this is an error and not allowed in Phoenix to ingest this data.
The error message is:
ERROR 514 (42892): A duplicate column name was detected in the object definition or ALTER TABLE/VIEW statement
I am expecting this output so that phoenix could differentiate the columns.
+-----------+-----------+--------------+---------------+------------------------+------------------------+
|Total Value| Topic |values.value | values.points | values.properties.date | values.properties.value| |
+-----------+-----------+--------------+---------------+------------------------+------------------------+
| 3 | Example | example1 | [123,156] | 12-04-19 | Model example 1 |
| 3 | Example | example2 | [124,157] | 12-05-19 | Model example 2 |
+-----------+-----------+--------------+---------------+------------------------+------------------------+
In this way phoenix can ingest the data perfectly, please suggest any changes in flattening code or any help to achieve the same. Thanks
You need slight changes to the recurs method:
Dealing with ArrayType(st: StructType, _) instead of ArrayType.
Avoid using *, and name every field in the second match (StructType).
Use backticks at the right places to rename the fields, keeping precedence naming.
Here's some code:
def recurs(df: DataFrame): DataFrame = {
if(!df.schema.fields.exists(_.dataType match {
case ArrayType(StructType(_),_) | StructType(_) => true
case _ => false
})) df
else {
val columns = df.schema.fields.flatMap(f => f.dataType match {
case ArrayType(st: StructType, _) => Seq(explode(col(f.name)).as(f.name))
case s: StructType =>
s.fieldNames.map{sf => col(s"`${f.name}`.$sf").as(s"${f.name}.$sf")}
case _ => Seq(col(s"`${f.name}`"))
})
recurs(df.select(columns:_*))
}
}
val newDF = recurs(df).cache
newDF.show(false)
newDF.printSchema
And the new output:
+-------+-----------+-------------+----------------------+-----------------------+------------+
|Topic |Total Value|values.points|values.properties.date|values.properties.value|values.value|
+-------+-----------+-------------+----------------------+-----------------------+------------+
|Example|3 |[[123, 156]] |12-04-19 |Model example 1 |#example1 |
|Example|3 |[[124, 157]] |12-05-19 |Model example 2 |#example2 |
+-------+-----------+-------------+----------------------+-----------------------+------------+
root
|-- Topic: string (nullable = true)
|-- Total Value: long (nullable = true)
|-- values.points: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- values.properties.date: string (nullable = true)
|-- values.properties.value: string (nullable = true)
|-- values.value: string (nullable = true)
I'm a total newbie in Spark&Scala stuff, it would be great if someone could explain this to me.
Let's take following JSON
{
"id": 1,
"persons": [{
"name": "n1",
"lastname": "l1",
"hobbies": [{
"name": "h1",
"activity": "a1"
},
{
"name": "h2",
"activity": "a2"
}]
},
{
"name": "n2",
"lastname": "l2",
"hobbies": [{
"name": "h3",
"activity": "a3"
},
{
"name": "h4",
"activity": "a4"
}]
}]
}
I'm loading this Json to RDD via sc.parralelize(file.json) and to DF via sqlContext.sql.load.json(file.json). So far so good, this gives me RDD and DF (with schema) for mentioned Json, but I want to create annother RDD/DF from existing one that contains all distinct "hobbies" records. How can I achieve sth like that?
The only things I get from my operations are multiple WrappedArrays for Hobbies but I cannot go deeper nor assign them to DF/RDD.
Code for SqlContext I have so far
val jsonData = sqlContext.read.json("path/file.json")
jsonData.registerTempTable("jsonData") //I receive schema for whole file
val hobbies = sqlContext.sql("SELECT persons.hobbies FROM jasonData") //subschema for hobbies
hobbies.show()
That leaves me with
+--------------------+
| hobbies|
+--------------------+
|[WrappedArray([a1...|
+--------------------+
What I expect is more like:
+--------------------+-----------------+
| name | activity |
+--------------------+-----------------|
| h1| a1 |
+--------------------+-----------------+
| h2| a2 |
+--------------------+-----------------+
| h3| a3 |
+--------------------+-----------------+
| h4| a4 |
+--------------------+-----------------+
I loaded your example into the dataframe hobbies exactly as you do it and worked with it. You could run something like the following:
val distinctHobbies = hobbies.rdd.flatMap {row => row.getSeq[List[Row]](0).flatten}.map(row => (row.getString(0), row.getString(1))).distinct
val dhDF = distinctHobbies.toDF("activity", "name")
This essentially flattens your hobbies struct, transforms it into a tuple, and runs a distinct on the returned tuples. We then turn it back into a dataframe under the correct column aliases. Because we are doing this through the underlying RDD, there may also be a more efficient way to do it using just the DataFrame API.
Regardless, when I run on your example, I see:
scala> val distinctHobbies = hobbies.rdd.flatMap {row => row.getSeq[List[Row]](0).flatten}.map(row => (row.getString(0), row.getString(1))).distinct
distinctHobbies: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[121] at distinct at <console>:24
scala> val dhDF = distinctHobbies.toDF("activity", "name")
dhDF: org.apache.spark.sql.DataFrame = [activity: string, name: string]
scala> dhDF.show
...
+--------+----+
|activity|name|
+--------+----+
| a2| h2|
| a1| h1|
| a3| h3|
| a4| h4|
+--------+----+
im have dataframe
id|name | surname
------------------
1 |James| Smith
2 |Mat | Stone
3 |Stan | Daimon
im need convert this to array json object(just string)
[
{id:1,name:"James",surname:"Smith"},
{id:2,name:"Mat",surname:"Stone"},
{id:3,name:"Stan",surname:"Daimon"}
]
We can use toJSON from library(jsonlite)
library(jsonlite)
toJSON(df1)
#[{"id":1,"name":"James","surname":"Smith"},{"id":2,"name":"Mat","surname":"Stone"},{"id":3,"name":"Stan","surname":"Daimon"}]
data
df1 <- structure(list(id = 1:3, name = c("James", "Mat", "Stan"),
surname = c("Smith",
"Stone", "Daimon")), .Names = c("id", "name", "surname"),
class = "data.frame", row.names = c(NA, -3L))