How to read a Nested JSON in Spark Scala? - json

Here is my Nested JSON file.
{
"dc_id": "dc-101",
"source": {
"sensor-igauge": {
"id": 10,
"ip": "68.28.91.22",
"description": "Sensor attached to the container ceilings",
"temp":35,
"c02_level": 1475,
"geo": {"lat":38.00, "long":97.00}
},
"sensor-ipad": {
"id": 13,
"ip": "67.185.72.1",
"description": "Sensor ipad attached to carbon cylinders",
"temp": 34,
"c02_level": 1370,
"geo": {"lat":47.41, "long":-122.00}
},
"sensor-inest": {
"id": 8,
"ip": "208.109.163.218",
"description": "Sensor attached to the factory ceilings",
"temp": 40,
"c02_level": 1346,
"geo": {"lat":33.61, "long":-111.89}
},
"sensor-istick": {
"id": 5,
"ip": "204.116.105.67",
"description": "Sensor embedded in exhaust pipes in the ceilings",
"temp": 40,
"c02_level": 1574,
"geo": {"lat":35.93, "long":-85.46}
}
}
}
How can I read the JSON file into Dataframe with Spark Scala. There is no array object in the JSON file, so I can't use explode. Can anyone help?

val df = spark.read.option("multiline", true).json("data/test.json")
df
.select(col("dc_id"), explode(array("source.*")) as "level1")
.withColumn("id", col("level1.id"))
.withColumn("ip", col("level1.ip"))
.withColumn("temp", col("level1.temp"))
.withColumn("description", col("level1.description"))
.withColumn("c02_level", col("level1.c02_level"))
.withColumn("lat", col("level1.geo.lat"))
.withColumn("long", col("level1.geo.long"))
.drop("level1")
.show(false)
Sample Output:
+------+---+---------------+----+------------------------------------------------+---------+-----+-------+
|dc_id |id |ip |temp|description |c02_level|lat |long |
+------+---+---------------+----+------------------------------------------------+---------+-----+-------+
|dc-101|10 |68.28.91.22 |35 |Sensor attached to the container ceilings |1475 |38.0 |97.0 |
|dc-101|8 |208.109.163.218|40 |Sensor attached to the factory ceilings |1346 |33.61|-111.89|
|dc-101|13 |67.185.72.1 |34 |Sensor ipad attached to carbon cylinders |1370 |47.41|-122.0 |
|dc-101|5 |204.116.105.67 |40 |Sensor embedded in exhaust pipes in the ceilings|1574 |35.93|-85.46 |
+------+---+---------------+----+------------------------------------------------+---------+-----+-------+
Instead of selecting each column, you can try writing some generic UDF to get all the individual columns.
Note: Tested with Spark 2.3

Taken the string into a variable called jsonString
import org.apache.spark.sql._
import spark.implicits._
val df = spark.read.json(Seq(jsonString).toDS)
val df1 = df.withColumn("lat" ,explode(array("source.sensor-igauge.geo.lat")))
You can follow the same steps for other structures as well - map/ array structures

val df = spark.read.option("multiline", true).json("myfile.json")
df.select($"dc_id", explode(array("source.*")))
.select($"dc_id", $"col.c02_level", $"col.description", $"col.geo.lat", $"col.geo.long", $"col.id", $"col.ip", $"col.temp")
.show(false)
Output:
+------+---------+------------------------------------------------+-----+-------+---+---------------+----+
|dc_id |c02_level|description |lat |long |id |ip |temp|
+------+---------+------------------------------------------------+-----+-------+---+---------------+----+
|dc-101|1475 |Sensor attached to the container ceilings |38.0 |97.0 |10 |68.28.91.22 |35 |
|dc-101|1346 |Sensor attached to the factory ceilings |33.61|-111.89|8 |208.109.163.218|40 |
|dc-101|1370 |Sensor ipad attached to carbon cylinders |47.41|-122.0 |13 |67.185.72.1 |34 |
|dc-101|1574 |Sensor embedded in exhaust pipes in the ceilings|35.93|-85.46 |5 |204.116.105.67 |40 |
+------+---------+------------------------------------------------+-----+-------+---+---------------+----+

Related

Split array of structs from JSON into Dataframe rows in SPARK

I am reading Kafka through Spark Structured streaming. The input Kafka message is of the below JSON format:
[
{
"customer": "Jim",
"sex": "male",
"country": "US"
},
{
"customer": "Pam",
"sex": "female",
"country": "US"
}
]
I have the define the schema like below to parse it:
val schemaAsJson = ArrayType(StructType(Seq(
StructField("customer",StringType,true),
StructField("sex",StringType,true),
StructField("country",StringType,true))),true)
My code looks like this,
df.select(from_json($"col", schemaAsJson) as "json")
.select("json.customer","json.sex","json.country")
The current output looks like this,
+--------------+----------------+----------------+
| customer| sex|country |
+--------------+----------------+----------------+
| [Jim, Pam]| [male, female]| [US, US]|
+--------------+----------------+----------------+
Expected output:
+--------------+----------------+----------------+
| customer| sex| country|
+--------------+----------------+----------------+
| Jim| male| US|
| Pam| female| US|
+--------------+----------------+----------------+
How do I split array of structs into individual rows as above? Can someone please help?
You need explode column before selecting.
df.select(explode_outer(from_json($"value", schemaAsJson)) as "json")
.select("json.customer","json.sex","json.country").show()

Get category of movie from json struct using spark scala

I have a df_movies and col of geners that look like json format.
|genres |
[{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adventure'}, {'id': 37, 'name': 'Western'}]
How can I extract the first field of 'name': val?
way #1
df_movies.withColumn
("genres_extract",regexp_extract(col("genres"),
""" 'name': (\w+)""",1)).show(false)
way #2
df_movies.withColumn
("genres_extract",regexp_extract(col("genres"),
"""[{'id':\s\d,\s 'name':\s(\w+)""",1))
Excepted: Action
You can use get_json_object function:
Seq("""[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 37, "name": "Western"}]""")
.toDF("genres")
.withColumn("genres_extract", get_json_object(col("genres"), "$[0].name" ))
.show()
+--------------------+--------------+
| genres|genres_extract|
+--------------------+--------------+
|[{"id": 28, "name...| Action|
+--------------------+--------------+
Another possibility is using the from_json function together with a self defined schema. This allows you to "unwrap" the json structure into a dataframe with all of the data in there, so that you can use it however you want!
Something like the following:
import org.apache.spark.sql.types._
Seq("""[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 37, "name": "Western"}]""")
.toDF("genres")
// Creating the necessary schema for the from_json function
val moviesSchema = ArrayType(
new StructType()
.add("id", StringType)
.add("name", StringType)
)
// Parsing the json string into our schema, exploding the column to make one row
// per json object in the array and then selecting the wanted columns,
// unwrapping the parsedActions column into separate columns
val parsedDf = df
.withColumn("parsedMovies", explode(from_json(col("genres"), moviesSchema)))
.select("parsedMovies.*")
parsedDf.show(false)
+---+---------+
| id| name|
+---+---------+
| 28| Action|
| 12|Adventure|
| 37| Western|
+---+---------+

How to explode structs with pyspark explode()

How do I convert the following JSON into the relational rows that follow it? The part that I am stuck on is the fact that the pyspark explode() function throws an exception due to a type mismatch. I have not found a way to coerce the data into a suitable format so that I can create rows out of each object within the source key within the sample_json object.
JSON INPUT
sample_json = """
{
"dc_id": "dc-101",
"source": {
"sensor-igauge": {
"id": 10,
"ip": "68.28.91.22",
"description": "Sensor attached to the container ceilings",
"temp":35,
"c02_level": 1475,
"geo": {"lat":38.00, "long":97.00}
},
"sensor-ipad": {
"id": 13,
"ip": "67.185.72.1",
"description": "Sensor ipad attached to carbon cylinders",
"temp": 34,
"c02_level": 1370,
"geo": {"lat":47.41, "long":-122.00}
},
"sensor-inest": {
"id": 8,
"ip": "208.109.163.218",
"description": "Sensor attached to the factory ceilings",
"temp": 40,
"c02_level": 1346,
"geo": {"lat":33.61, "long":-111.89}
},
"sensor-istick": {
"id": 5,
"ip": "204.116.105.67",
"description": "Sensor embedded in exhaust pipes in the ceilings",
"temp": 40,
"c02_level": 1574,
"geo": {"lat":35.93, "long":-85.46}
}
}
}"""
DESIRED OUTPUT
dc_id source_name id description
-------------------------------------------------------------------------------
dc-101 sensor-gauge 10 Sensor attached to the container ceilings
dc-101 sensor-ipad 13 Sensor ipad attached to carbon cylinders
dc-101 sensor-inest 8 Sensor attached to the factory ceilings
dc-101 sensor-istick 5 Sensor embedded in exhaust pipes in the ceilings
PYSPARK CODE
from pyspark.sql.functions import *
df_sample_data = spark.read.json(sc.parallelize([sample_json]))
df_expanded = df_sample_data.withColumn("one_source",explode_outer(col("source")))
display(df_expanded)
ERROR
AnalysisException: cannot resolve 'explode(source)' due to data type
mismatch: input to function explode should be array or map type, not
struct....
I put together this Databricks notebook to further demonstrate the challenge and clearly show the error. I will be able to use this notebook to test any recommendations provided herein.
You can't use explode for structs but you can get the column names in the struct source (with df.select("source.*").columns) and using list comprehension you create an array of the fields you want from each nested struct, then explode to get the desired result :
from pyspark.sql import functions as F
df1 = df.select(
"dc_id",
F.explode(
F.array(*[
F.struct(
F.lit(s).alias("source_name"),
F.col(f"source.{s}.id").alias("id"),
F.col(f"source.{s}.description").alias("description")
)
for s in df.select("source.*").columns
])
).alias("sources")
).select("dc_id", "sources.*")
df1.show(truncate=False)
#+------+-------------+---+------------------------------------------------+
#|dc_id |source_name |id |description |
#+------+-------------+---+------------------------------------------------+
#|dc-101|sensor-igauge|10 |Sensor attached to the container ceilings |
#|dc-101|sensor-inest |8 |Sensor attached to the factory ceilings |
#|dc-101|sensor-ipad |13 |Sensor ipad attached to carbon cylinders |
#|dc-101|sensor-istick|5 |Sensor embedded in exhaust pipes in the ceilings|
#+------+-------------+---+------------------------------------------------+

Extract data from JSON and insert it as new by using jq

I have some database in JSON file, I had already sort and remove some data from object by using ./jq
But I'm stuck at adding new variables in object.
Here is a part of my JSON file:
{
"Name": "Forrest.Gump.1994.MULTi.1080p.AMZN.WEB-DL.DDP5.1.H264-Ao",
"ID": "SMwIkBoC2blXeWnBa9Hjge9YPs90"
},
{
"Name": "Point.Blank.2019.MULTi.1080p.NF.WEB-DL.DDP5.1.x264-Ao",
"ID": "OZI4mOuBXuJ7b89FLgXJoozyhHe9"
},
{
"Name": "The.Incredible.Hulk.2008.MULTi.2160p.UHD.BluRay.REMUX.HDR.HEVC.DTS-HD.MA.7.1",
"ID": "jZzR4_B_vjm593cYKR7j97XAMv6d"
},
Is it possible by using jq and for example RegExp to extract some data and insert it as new variable in object, I wish to achive something like this:
{
"Name": "Forrest.Gump.1994.MULTi.1080p.AMZN.WEB-DL.DDP5.1.H264-Ao",
"ID": "SMwIkBoC2blXeWnBa9Hjge9YPs90",
"Year": "1994",
"Res": "1080p"
},
{
"Name": "Point.Blank.2019.MULTi.1080p.NF.WEB-DL.DDP5.1.x264-Ao",
"ID": "OZI4mOuBXuJ7b89FLgXJoozyhHe9",
"Year": "2019",
"Res": "1080p"
},
{
"Name": "The.Incredible.Hulk.2008.MULTi.2160p.UHD.BluRay.REMUX.HDR.HEVC.DTS-HD.MA.7.1",
"ID": "jZzR4_B_vjm593cYKR7j97XAMv6d",
"Year": "2008",
"Res": "2160p"
},
Thanks in advance
Here's one solution that assumes for simplicity that the fragment you've shown comes from an array:
map( . as $in
| .Name | capture(".*[.](?<year>[12][0-9]{3})[.](?<rest>.*)")
| .year as $year
| (.rest | split(".") | .[1]) as $res
| $in + {Year: $year, Res: $res} )
Hopefully, once you're familiar with some jq basics, such as map, capture, and the EXP as $var syntax, the above will be more-or-less self-explanatory.
As a one-liner
Here's the same thing but as a one-liner:
map(. + (.Name | capture(".*[.](?<Year>[12][0-9]{3})[.](?<Res>.*)") | {Year, Res: (.Res | split(".")[1])}))

json2sstable error during conversion from json to sstable

Here I have a json input which I want to import into cassandra so i am using json2stable as below
./json2sstable -K yelp -c business /home/srinath/Desktop/test.json /home/srinath/Desktop/CD/Cassandra/cassandra/data/yelp/business/Standard1-e-1-Data.db
Output:
ERROR 15:03:02,594 Unable to initialize MemoryMeter (jamm not specified as javaagent). This means Cassandra will be unable to measure object sizes accurately and may consequently OOM.
org.codehaus.jackson.map.JsonMappingException: Can not deserialize instance of java.lang.Object[] out of START_OBJECT token
at [Source: /home/srinath/Desktop/test.json; line: 1, column: 1]
at org.codehaus.jackson.map.JsonMappingException.from(JsonMappingException.java:163)
at org.codehaus.jackson.map.deser.StdDeserializationContext.mappingException(StdDeserializationContext.java:219)
at org.codehaus.jackson.map.deser.StdDeserializationContext.mappingException(StdDeserializationContext.java:212)
at org.codehaus.jackson.map.deser.std.ObjectArrayDeserializer.handleNonArray(ObjectArrayDeserializer.java:177)
at org.codehaus.jackson.map.deser.std.ObjectArrayDeserializer.deserialize(ObjectArrayDeserializer.java:88)
at org.codehaus.jackson.map.deser.std.ObjectArrayDeserializer.deserialize(ObjectArrayDeserializer.java:18)
at org.codehaus.jackson.map.ObjectMapper._readValue(ObjectMapper.java:2695)
at org.codehaus.jackson.map.ObjectMapper.readValue(ObjectMapper.java:1294)
at org.codehaus.jackson.JsonParser.readValueAs(JsonParser.java:1368)
at org.apache.cassandra.tools.SSTableImport.importUnsorted(SSTableImport.java:351)
at org.apache.cassandra.tools.SSTableImport.importJson(SSTableImport.java:335)
at org.apache.cassandra.tools.SSTableImport.main(SSTableImport.java:559)
ERROR: Can not deserialize instance of java.lang.Object[] out of START_OBJECT token
at [Source: /home/srinath/Desktop/test.json; line: 1, column: 1]
================================================================================================================================================
Sample Json:
{
"business_id": "qarobAbxGSHI7ygf1f7a_Q",
"full_address": "891 E Baseline Rd\nSuite 102\nGilbert, AZ 85233",
"open": true,
"categories": [
"Sandwiches",
"Restaurants"
],
"city": "Gilbert",
"review_count": 10,
"name": "Jersey Mike's Subs",
"neighborhoods": [],
"longitude": -111.8120071,
"state": "AZ",
"stars": 3.5,
"latitude": 33.3788385,
"type": "business"
}
cid | key | ts
-----+------+-----
101 | ramu | 999
[{
"columns":[["cid",101],["key","ramu"],["ts",687]]
}]
Above json format is based on the above table..
Like that you can prepare your json based on your table format and columns.