I have a pySpark dataframe looking like this:
|id|json |
+--+--------------------------------------+
|1 |{"attr1": "value1"} |
|2 |{"attr2": "value2", "attr3": "value3"}|
root
|-- id: string (nullable = true)
|-- json: string (nullable = true)
How do I convert it into a new dataframe which will look like this:
|id|attr |value |
+--+-----+------+
|1 |attr1|value1|
|2 |attr2|value2|
|2 |attr3|value3|
(tried to google for the solution with no success, apologies if it's a duplicate)
Thanks!
Please check schema, looks to me like a map type. if column json is of maptype, use map_entries to extract elements and explode.
df=spark.createDataFrame(Data, schema)
new.withColumn('attri', explode(map_entries('json'))).select('id','attri.*').show()
Related
It seems like there should be a function for this in Spark SQL similar to pivoting, but I haven't found any solution to transforming a JSON key into a a value. Suppose I have a badly formed JSON (the format of which I cannot change):
{"A long string containing serverA": {"x": 1, "y": 2}}
how can I process it to
{"server": "A", "x": 1, "y": 2}
?
I read the JSONs into an an sql.dataframe and would then like to process them as described above:
val cs = spark.read.json("sample.json")
.???
If we want to use only spark functions and no UDFs, you could use from_json to parse the json into a map (we need to specify a schema). Then you just need to extract the information with spark functions.
One way to do it is as follows:
val schema = MapType(
StringType,
StructType(Array(
StructField("x", IntegerType),
StructField("y", IntegerType)
))
)
spark.read.text("...")
.withColumn("json", from_json('value, schema))
.withColumn("key", map_keys('json).getItem(0))
.withColumn("value", map_values('json).getItem(0))
.withColumn("server",
// Extracting the server name with a regex
regexp_replace(regexp_extract('key, "server[^ ]*", 0), "server", ""))
.select("server", "value.*")
.show(false)
which yields:
+------+---+---+
|server|x |y |
+------+---+---+
|A |1 |2 |
+------+---+---+
I'm completely new to spark, but don't mind if the answer is in python or Scala. I can't show the actual data for privacy reasons, but basically I am reading json files with a structure like this:
{
"EnqueuedTimeUtc": 'some date time',
"Properties": {},
"SystemProperties": {
"connectionDeviceId": "an id",
"some other fields that we don't care about": "data"
},
"Body": {
"device_id": "an id",
"tabs": [
{
"selected": false,
"title": "some title",
"url": "https:...."
},
{"same again, for multiple tabs"}
]
}
}
Most of the data is of no interest. What I want is a Dataframe consisting of the time, device_id, and url. There can be multiple urls for the same device and time, so I'm looking to explode these into one row per url.
| timestamp | device_id | url |
My immediate problem is that when I read this, although it can work out the structure of SystemProperties, Body is just a string, probably because of variation. Perhaps I need to specify the schema, would that help?
root
|-- Body: string (nullable = true)
|-- EnqueuedTimeUtc: string (nullable = true)
|-- SystemProperties: struct (nullable = true)
| |-- connectionAuthMethod: string (nullable = true)
| |-- connectionDeviceGenerationId: string (nullable = true)
| |-- connectionDeviceId: string (nullable = true)
| |-- contentEncoding: string (nullable = true)
| |-- contentType: string (nullable = true)
| |-- enqueuedTime: string (nullable = true)
Any idea of an efficient (there are lots and lots of these records) way to extract urls and associate with the time and device_id? Thanks in advance.
Here's an example for extraction. Basically you can use from_json to convert the Body to something that is more structured, and use explode(transform()) to get the URLs and expand to different rows.
# Sample dataframe
df.show(truncate=False)
+----------------------------------------------------------------------------------------------------------------------------------------------+---------------+----------------+
|Body |EnqueuedTimeUtc|SystemProperties|
+----------------------------------------------------------------------------------------------------------------------------------------------+---------------+----------------+
|{"device_id":"an id","tabs":[{"selected":false,"title":"some title","url":"https:1"},{"selected":false,"title":"some title","url":"https:2"}]}|some date time |[an id] |
+----------------------------------------------------------------------------------------------------------------------------------------------+---------------+----------------+
df.printSchema()
root
|-- Body: string (nullable = true)
|-- EnqueuedTimeUtc: string (nullable = true)
|-- SystemProperties: struct (nullable = true)
| |-- connectionDeviceId: string (nullable = true)
# Extract desired properties
df2 = df.selectExpr(
"EnqueuedTimeUtc as timestamp",
"from_json(Body, 'device_id string, tabs array<map<string,string>>') as Body"
).selectExpr(
"timestamp",
"Body.device_id",
"explode(transform(Body.tabs, x -> x.url)) as url"
)
df2.show()
+--------------+---------+-------+
| timestamp|device_id| url|
+--------------+---------+-------+
|some date time| an id|https:1|
|some date time| an id|https:2|
+--------------+---------+-------+
Json file is in the following format:-
**Input-**
{'key-a' : [{'key1':'value1', 'key2':'value2'},{'key1':'value3', 'key2':'value4'}...],
'key-b':'value-b',
'key-c':'value-c'},
{'key-a' : [{'key1':'value5', 'key2':'value6'},{'key1':'value7', 'key2':'value8'}...],
'key-b':'value-b',
'key-c':'value-c'}
I need to combine the data to merge all the values of 'key-a' and return a single json object as output:
**Output-**
{'key-a' :
[{'key1':'value1', 'key2':'value2'},
{'key1':'value3', 'key2':'value4'},
{'key1':'value5', 'key2':'value6'},
{'key1':'value7', 'key2':'value8'}...],
'key-b':'value-b',
'key-c':'value-c'}
The data is loaded in a pyspark dataframe with the following schema:-
**Schema:**
key-a
|-- key1: string (nullable= false)
|-- key2: string (nullable= true)
key-b: string (nullable= true)
key-c: string (nullable= false)
I have tried using the groupbykey function but when I try to Show() the output, I get the following error: "groupeddata object has no attribute 'show' pyspark".
How to achieve the above transformation?
PFA- Error received when trying below answer
This can be a working solution for you -
# Create the dataframe here
df_new = spark.createDataFrame([(str({"key-a":[{"key1":"value1","key2":"value2"}, {"key1": "value3", "key2": "value4"}], "key-b" :"value-b"})), (str({"key-a":[{"key1":"value5","key2":"value6"}, {"key1": "value7", "key2": "value8"}], "key-b" :"value-b"}))],T.StringType())
df_new.show(truncate=False)
+-----------------------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------------------+
|{'key-a': [{'key1': 'value1', 'key2': 'value2'}, {'key1': 'value3', 'key2': 'value4'}], 'key-b': 'value-b'}|
|{'key-a': [{'key1': 'value5', 'key2': 'value6'}, {'key1': 'value7', 'key2': 'value8'}], 'key-b': 'value-b'}|
+-----------------------------------------------------------------------------------------------------------+
Use from_json with correct schema to valuate the column first -
The idea here is to get the keys of the json in a column and then use groupBy
df = df_new.withColumn('col', F.from_json("value",T.MapType(T.StringType(), T.StringType())))
df = df.select("col", F.explode("col").alias("x", "y"))
df.select("x", "y").show(truncate=False)
+-----+---------------------------------------------------------------------+
|x |y |
+-----+---------------------------------------------------------------------+
|key-a|[{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}]|
|key-b|value-b |
|key-a|[{"key1":"value5","key2":"value6"},{"key1":"value7","key2":"value8"}]|
|key-b|value-b |
+-----+---------------------------------------------------------------------+
Logic Here-
We have created a dummy column for the sake of grouping
df_grp = df.groupBy("x").agg(F.collect_set("y").alias("y"))
df_grp = df_grp.withColumn("y", F.col("y").cast(T.StringType()))
df_grp = df_grp.withColumn("array", F.array("x", "y"))
df_grp = df_grp.withColumn("dummy_col", F.lit("1"))
df_grp = df_grp.groupBy("dummy_col").agg(F.collect_set("array"))
df_grp.show(truncate=False)
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|dummy_col|collect_set(array) |
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |[[key-a, [[{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}], [{"key1":"value5","key2":"value6"},{"key1":"value7","key2":"value8"}]]], [key-b, [value-b]]]|
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I have tried using the groupbykey function but when I try to Show() the output, I get the following error: "groupeddata object has no attribute 'show' pyspark".
This was giving trouble because you did not use any aggregate function in the groupBy clause.
I am using the MongoDb Scala connector for Spark. In the documentation
https://docs.mongodb.com/spark-connector/master/scala/aggregation/
it is mentioned that how to apply filter on the given JSON document. What I am not able to figure out that if we have a multilevel json and we want to apply filter on it how will we access that key/value in the json Document.
Json Document:
{ "_id" : 1, "test" : 1 }
{ "_id" : 2, "test" : 2 }
{ "_id" : 3, "test" : 3 }
{ "_id" : 4, "test" : 4 }
{ "_id" : 5, "test" : 5 }
{ "_id" : 6, "test" : 6 }
Filter Document:
val rdd = MongoSpark.load(sc)
val filteredRdd = rdd.filter(doc => doc.getInteger("test") > 5)
println(filteredRdd.count)
println(filteredRdd.first.toJson)
Multilevel Json Document
{
"_id": 1,
"test": 1,
"additionalProperties": {
"value": "35",
"phone": "566623232"
}
}
Problem Statement:
I want to filter on the basis of "value" attribute but I don't know how to access it. I tried to do following but it is not working.
val filteredRdd = rdd.filter(doc => doc.getInteger("value") > 5)
val filteredRdd = rdd.filter(doc => doc.getInteger("additionalProperties.value") > 5)
Can anybody guide me that how can I access the "value" attribute? What would be the right syntax.
Some Other Options that I have tried:
According to the official documentation of Scala Connector for Spark of MongoDB. I tried filtering the document with the Aggregation Pipeline. So Following line of code works fine:
val filterWithPipeline = customRdd.withPipeline(Seq(Document.parse("{ $match: { id: { $eq : '134' } } }")))
But IF I want to access the "value" item by using the same syntax. It doesn't work.
val filterWithPipeline = customRdd.withPipeline(Seq(Document.parse("{ $match: { value: { $eq : '134' } } }")))
So How can I use the same approach to query the multi level JSON?
Here are couple of ways you can read from mongoDB and filter it
Creating SparkSession
val spark = SparkSession.builder().master("local").appName("Test")
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/db.collectionName")
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/db.CollectionName")
.getOrCreate
import spark.implicits._
import com.mongodb.spark.sql._
Read as MongoRDD[Document] and filter it
MongoSpark.load(spark.sparkContext).filter(doc => {
val value = doc.get("additionalProperties").asInstanceOf[Document].get("value")
value.toString.toInt > 5
})
Read as Dataframe with spark.read.mongo()
val filterDF = spark.read.mongo().filter($"additionalProperties.value".lt(5))
Output:
+---+--------------------+----+
|_id|additionalProperties|test|
+---+--------------------+----+
|2.0|[5, 566623232] |2.0 |
+---+--------------------+----+
Hope this helps!
What if you use the dafaframe?
val df = spark.read.json("path")
Here is my example,
+---+--------------------+----+
|_id|additionalProperties|test|
+---+--------------------+----+
|1 |[566623232, 35] |1 |
|2 |[566623232, 35] |2 |
|3 |[566623232, 1] |3 |
+---+--------------------+----+
and the schema is
root
|-- _id: long (nullable = true)
|-- additionalProperties: struct (nullable = true)
| |-- phone: string (nullable = true)
| |-- value: string (nullable = true)
|-- test: long (nullable = true)
Then,
df.filter(col("additionalProperties").getItem("value").cast("int") > 5)
will give the result such as:
+---+--------------------+----+
|_id|additionalProperties|test|
+---+--------------------+----+
|1 |[566623232, 35] |1 |
|2 |[566623232, 35] |2 |
+---+--------------------+----+
I am trying to extract certain parameters from a nested JSON (having dynamic schema) and generate a spark dataframe using pyspark.
My code works perfectly for level 1 (key:value) but fails get independent columns for each (key:value) pair that are a part of nested JSON.
JSON schema sample
Note - This is not the exact schema. Its just to give the idea of nested nature of the schema
{
"tweet": {
"text": "RT #author original message"
"user": {
"screen_name": "Retweeter"
},
"retweeted_status": {
"text": "original message".
"user": {
"screen_name": "OriginalTweeter"
},
"place": {
},
"entities": {
},
"extended_entities": {
}
},
},
"entities": {
},
"extended_entities": {
}
}
}
PySpark Code
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([
StructField("text", StringType(), True),
StructField("created_at", StringType(), True),
StructField("retweeted_status", StructType([
StructField("text", StringType(), True),
StructField("created_at", StringType(), True)]))
])
df = spark.read.schema(schema).json("/user/sagarp/NaMo/data/NaMo2019-02-12_00H.json")
df.show()
Current output - (with real JSON data)
All (keys:values) under nested retweet_status JSON are squashed into 1 single list. eg [text, created_at, entities]
+--------------------+--------------------+--------------------+
| text| created_at| retweeted_status|
+--------------------+--------------------+--------------------+
|RT #Hoosier602: #...|Mon Feb 11 19:04:...|[#CLeroyjnr #Gabr...|
|RT #EgSophie: Oh ...|Mon Feb 11 19:04:...|[Oh cool so do yo...|
|RT #JacobAWohl: #...|Mon Feb 11 19:04:...|[#realDonaldTrump...|
Expected output
I want independent columns for each key. Also, note that you already have a parent level key by the same name text. How will you deal with such instances?
Ideally, I would want columns like "text", "entities", "retweet_status_text", "retweet_status_entities", etc
Your schema is not mapped properly ... please see these posts if you want to manually construct schema (which is recommended if the data doesn't change):
PySpark: How to Update Nested Columns?
https://docs.databricks.com/_static/notebooks/complex-nested-structured.html
Also, if your JSON is multi-line (like your example) then you can ...
read json via multi-line option to get Spark to infer schema
then save nested schema
then read data back in with the correct schema mapping to avoid triggering a Spark job
! cat nested.json
[
{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}},
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}},
{
"string": "string3",
"int": 3,
"array": [
3,
6,
9
],
"dict": {
"key": "value3",
"extra_key": "extra_value3"
}
}
]
getSchema = spark.read.option("multiline", "true").json("nested.json")
extractSchema = getSchema.schema
print(extractSchema)
StructType(List(StructField(array,ArrayType(LongType,true),true),StructField(dict,StructType(List(StructField(extra_key,StringType,true),StructField(key,StringType,true))),true),StructField(int,LongType,true),StructField(string,StringType,true)))
loadJson = spark.read.option("multiline", "true").schema(extractSchema ).json("nested.json")
loadJson.printSchema()
root
|-- array: array (nullable = true)
| |-- element: long (containsNull = true)
|-- dict: struct (nullable = true)
| |-- extra_key: string (nullable = true)
| |-- key: string (nullable = true)
|-- int: long (nullable = true)
|-- string: string (nullable = true)
loadJson.show(truncate=False)
+---------+----------------------+---+-------+
|array |dict |int|string |
+---------+----------------------+---+-------+
|[1, 2, 3]|[, value1] |1 |string1|
|[2, 4, 6]|[, value2] |2 |string2|
|[3, 6, 9]|[extra_value3, value3]|3 |string3|
+---------+----------------------+---+-------+
Once you have the data loaded with the correct mapping then you can start to transform into a normalized schema via the "dot" notation for nested columns and "explode" to flatten arrays, etc.
loadJson\
.selectExpr("dict.key as key", "dict.extra_key as extra_key").show()
+------+------------+
| key| extra_key|
+------+------------+
|value1| null|
|value2| null|
|value3|extra_value3|
+------+------------+