How to explode structs with pyspark explode() - json

How do I convert the following JSON into the relational rows that follow it? The part that I am stuck on is the fact that the pyspark explode() function throws an exception due to a type mismatch. I have not found a way to coerce the data into a suitable format so that I can create rows out of each object within the source key within the sample_json object.
JSON INPUT
sample_json = """
{
"dc_id": "dc-101",
"source": {
"sensor-igauge": {
"id": 10,
"ip": "68.28.91.22",
"description": "Sensor attached to the container ceilings",
"temp":35,
"c02_level": 1475,
"geo": {"lat":38.00, "long":97.00}
},
"sensor-ipad": {
"id": 13,
"ip": "67.185.72.1",
"description": "Sensor ipad attached to carbon cylinders",
"temp": 34,
"c02_level": 1370,
"geo": {"lat":47.41, "long":-122.00}
},
"sensor-inest": {
"id": 8,
"ip": "208.109.163.218",
"description": "Sensor attached to the factory ceilings",
"temp": 40,
"c02_level": 1346,
"geo": {"lat":33.61, "long":-111.89}
},
"sensor-istick": {
"id": 5,
"ip": "204.116.105.67",
"description": "Sensor embedded in exhaust pipes in the ceilings",
"temp": 40,
"c02_level": 1574,
"geo": {"lat":35.93, "long":-85.46}
}
}
}"""
DESIRED OUTPUT
dc_id source_name id description
-------------------------------------------------------------------------------
dc-101 sensor-gauge 10 Sensor attached to the container ceilings
dc-101 sensor-ipad 13 Sensor ipad attached to carbon cylinders
dc-101 sensor-inest 8 Sensor attached to the factory ceilings
dc-101 sensor-istick 5 Sensor embedded in exhaust pipes in the ceilings
PYSPARK CODE
from pyspark.sql.functions import *
df_sample_data = spark.read.json(sc.parallelize([sample_json]))
df_expanded = df_sample_data.withColumn("one_source",explode_outer(col("source")))
display(df_expanded)
ERROR
AnalysisException: cannot resolve 'explode(source)' due to data type
mismatch: input to function explode should be array or map type, not
struct....
I put together this Databricks notebook to further demonstrate the challenge and clearly show the error. I will be able to use this notebook to test any recommendations provided herein.

You can't use explode for structs but you can get the column names in the struct source (with df.select("source.*").columns) and using list comprehension you create an array of the fields you want from each nested struct, then explode to get the desired result :
from pyspark.sql import functions as F
df1 = df.select(
"dc_id",
F.explode(
F.array(*[
F.struct(
F.lit(s).alias("source_name"),
F.col(f"source.{s}.id").alias("id"),
F.col(f"source.{s}.description").alias("description")
)
for s in df.select("source.*").columns
])
).alias("sources")
).select("dc_id", "sources.*")
df1.show(truncate=False)
#+------+-------------+---+------------------------------------------------+
#|dc_id |source_name |id |description |
#+------+-------------+---+------------------------------------------------+
#|dc-101|sensor-igauge|10 |Sensor attached to the container ceilings |
#|dc-101|sensor-inest |8 |Sensor attached to the factory ceilings |
#|dc-101|sensor-ipad |13 |Sensor ipad attached to carbon cylinders |
#|dc-101|sensor-istick|5 |Sensor embedded in exhaust pipes in the ceilings|
#+------+-------------+---+------------------------------------------------+

Related

How to parse nested json and write in Redshift?

I have a following json structure like this:
{
"firstname": "A",
"lastname": "B",
"age": 24,
"address": {
"streetAddress": "123",
"city": "San Jone",
"state": "CA",
"postalCode": "394221"
},
"phonenumbers": [
{ "type": "home", "number": "123456789" }
{ "type": "mobile", "number": "987654321" }
]
}
I need to copy this json from S3 to a Redshift table.
I am currently using copy command with a path file but it loads array as a single column.
I wanted the nested array to be parsed and the table should like this:
firstname|lastname|age|streetaddress|city |state|postalcode|type|number
-----------------------------------------------------------------------------
A | B |24 |123 |SanJose|CA |394221 |home|123456789
-----------------------------------------------------------------------------
A | B |24 |123 |SanJose|CA |394221 |mob|987654321
Is there a way to do that?
You can do use nested JSON paths by making use of JSON path files. However, this does not work with the multiple phone number types.
If you can modify the dataset to have multiple records (one for mobile, one for home) then your file would look similar to the below.
{
"jsonpaths": [
"$.firstname",
"$.lastname",
"$.venuestate",
"$.age",
"$.address.streetAddress",
"$.address.city",
"$.address.state",
"$.address.postalCode",
"$.phonenumbers[0].type",
"$.phonenumbers[0].number"
]
}
If you are unable to change the format you will need to perform an ETL task upon load before it can be consumed by Redshift. For this you could use an event for creation of objects to trigger a Lambda function and then perform the ETL process for you before it loads into Redshift.

reshape jq nested file and make csv

I've been struggling with this one for the whole day which i want to turn to a csv.
It represents the officers attached to company whose number is "OC418979" in the UK Company House API.
I've already truncated the json to contain just 2 objects inside "items".
What I would like to get is a csv like this
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
...
There are 2 extra complication: there are 2 types of "officers", some are people, some are companies, so not all key in people are present in the other and viceversa. I'd like these entries to be 'null'. Second complication is those nested objects like "name" which contains a comma in it! or address, which contains several sub-objects (which I guess I could flatten in pandas tho).
{
"total_results": 13,
"resigned_count": 9,
"links": {
"self": "/company/OC418979/officers"
},
"items_per_page": 35,
"etag": "bc7955679916b089445c9dfb4bc597aa0daaf17d",
"kind": "officer-list",
"active_count": 4,
"inactive_count": 0,
"start_index": 0,
"items": [
{
"officer_role": "llp-designated-member",
"name": "BARRICK, David James",
"date_of_birth": {
"year": 1984,
"month": 1
},
"appointed_on": "2017-09-15",
"country_of_residence": "England",
"address": {
"country": "United Kingdom",
"address_line_1": "Old Gloucester Street",
"locality": "London",
"premises": "27",
"postal_code": "WC1N 3AX"
},
"links": {
"officer": {
"appointments": "/officers/d_PT9xVxze6rpzYwkN_6b7og9-k/appointments"
}
}
},
{
"links": {
"officer": {
"appointments": "/officers/M2Ndc7ZjpyrjzCXdFZyFsykJn-U/appointments"
}
},
"address": {
"locality": "Tadcaster",
"country": "United Kingdom",
"address_line_1": "Westgate",
"postal_code": "LS24 9AB",
"premises": "5a"
},
"identification": {
"legal_authority": "UK",
"identification_type": "non-eea",
"legal_form": "UK"
},
"name": "PREMIER DRIVER LIMITED",
"officer_role": "corporate-llp-designated-member",
"appointed_on": "2017-09-15"
}
]
}
What I've been doing is creating new json objects extracting the fields I needed like this:
{officer_address:.items[]?.address, appointed_on:.items[]?.appointed_on, country_of_residence:.items[]?.country_of_residence, officer_role:.items[]?.officer_role, officer_dob:items.date_of_birth, officer_nationality:.items[]?.nationality, officer_occupation:.items[]?.occupation}
But the query runs for hours - and I am sure there is a quicker way.
Right now I am trying this new approach - creating a json whose root is the company number and as argument a list of its officers.
{(.links.self | split("/")[2]): .items[]}
Using jq, it's easier to extract values from the top-level object that will be shared and generate the desired rows. You'll want to limit the amounts of times you go through the items to at most once.
$ jq -r '(.links.self | split("/")[2]) as $companyCode
| .items[]
| [ $companyCode, .country_of_residence, .officer_role, .appointed_on ]
| #csv
' input.json
Ok, you want to scan the list of officers, extract some fields from there if they are present and write that in csv format.
First part is to extract the data from the json. Assuming you loaded it is a data Python object, you have:
print(data['items'][0]['officer_role'], data['items'][0]['appointed_on'],
data['items'][0]['country_of_residence'])
gives:
llp-designated-member 2017-09-15 England
Time to put everything together with the csv module:
import csv
...
with open('output.csv', 'w', newline='') as fd:
wr = csv.writer(fd)
for officer in data['items']:
_ = wr.writerow(('OC418979',
officer.get('country_of_residence',''),
officer.get('officer_role', ''),
officer.get('appointed_on', '')
))
The get method on a dictionnary allows to use a default value (here the empty string) if the key is not present, and the csv module ensures that if a field contains a comma, it will be enclosed in quotation marks.
With your example input, it gives:
OC418979,England,llp-designated-member,2017-09-15
OC418979,,corporate-llp-designated-member,2017-09-15

How to read a Nested JSON in Spark Scala?

Here is my Nested JSON file.
{
"dc_id": "dc-101",
"source": {
"sensor-igauge": {
"id": 10,
"ip": "68.28.91.22",
"description": "Sensor attached to the container ceilings",
"temp":35,
"c02_level": 1475,
"geo": {"lat":38.00, "long":97.00}
},
"sensor-ipad": {
"id": 13,
"ip": "67.185.72.1",
"description": "Sensor ipad attached to carbon cylinders",
"temp": 34,
"c02_level": 1370,
"geo": {"lat":47.41, "long":-122.00}
},
"sensor-inest": {
"id": 8,
"ip": "208.109.163.218",
"description": "Sensor attached to the factory ceilings",
"temp": 40,
"c02_level": 1346,
"geo": {"lat":33.61, "long":-111.89}
},
"sensor-istick": {
"id": 5,
"ip": "204.116.105.67",
"description": "Sensor embedded in exhaust pipes in the ceilings",
"temp": 40,
"c02_level": 1574,
"geo": {"lat":35.93, "long":-85.46}
}
}
}
How can I read the JSON file into Dataframe with Spark Scala. There is no array object in the JSON file, so I can't use explode. Can anyone help?
val df = spark.read.option("multiline", true).json("data/test.json")
df
.select(col("dc_id"), explode(array("source.*")) as "level1")
.withColumn("id", col("level1.id"))
.withColumn("ip", col("level1.ip"))
.withColumn("temp", col("level1.temp"))
.withColumn("description", col("level1.description"))
.withColumn("c02_level", col("level1.c02_level"))
.withColumn("lat", col("level1.geo.lat"))
.withColumn("long", col("level1.geo.long"))
.drop("level1")
.show(false)
Sample Output:
+------+---+---------------+----+------------------------------------------------+---------+-----+-------+
|dc_id |id |ip |temp|description |c02_level|lat |long |
+------+---+---------------+----+------------------------------------------------+---------+-----+-------+
|dc-101|10 |68.28.91.22 |35 |Sensor attached to the container ceilings |1475 |38.0 |97.0 |
|dc-101|8 |208.109.163.218|40 |Sensor attached to the factory ceilings |1346 |33.61|-111.89|
|dc-101|13 |67.185.72.1 |34 |Sensor ipad attached to carbon cylinders |1370 |47.41|-122.0 |
|dc-101|5 |204.116.105.67 |40 |Sensor embedded in exhaust pipes in the ceilings|1574 |35.93|-85.46 |
+------+---+---------------+----+------------------------------------------------+---------+-----+-------+
Instead of selecting each column, you can try writing some generic UDF to get all the individual columns.
Note: Tested with Spark 2.3
Taken the string into a variable called jsonString
import org.apache.spark.sql._
import spark.implicits._
val df = spark.read.json(Seq(jsonString).toDS)
val df1 = df.withColumn("lat" ,explode(array("source.sensor-igauge.geo.lat")))
You can follow the same steps for other structures as well - map/ array structures
val df = spark.read.option("multiline", true).json("myfile.json")
df.select($"dc_id", explode(array("source.*")))
.select($"dc_id", $"col.c02_level", $"col.description", $"col.geo.lat", $"col.geo.long", $"col.id", $"col.ip", $"col.temp")
.show(false)
Output:
+------+---------+------------------------------------------------+-----+-------+---+---------------+----+
|dc_id |c02_level|description |lat |long |id |ip |temp|
+------+---------+------------------------------------------------+-----+-------+---+---------------+----+
|dc-101|1475 |Sensor attached to the container ceilings |38.0 |97.0 |10 |68.28.91.22 |35 |
|dc-101|1346 |Sensor attached to the factory ceilings |33.61|-111.89|8 |208.109.163.218|40 |
|dc-101|1370 |Sensor ipad attached to carbon cylinders |47.41|-122.0 |13 |67.185.72.1 |34 |
|dc-101|1574 |Sensor embedded in exhaust pipes in the ceilings|35.93|-85.46 |5 |204.116.105.67 |40 |
+------+---------+------------------------------------------------+-----+-------+---+---------------+----+

CSV to JSON and add title

I have a csv document:
{
"epsilon_id": 194029423,
"weather": "cloudy",
"temperature": 27
},
{
"epsilon_id": 932856192,
"weather": "sunny",
"temperature": 31
}
I was wondering if there was a tool to make it into valid json where the field epsilon_id is the title for the data.
ex:
{
194029423: {
"weather": "cloudy",
"temperature": 27
},
932856192: {
"weather": "sunny",
"temperature": 31
}
}
I would prefer it to be a program (in whatever language) that I can run because I have 1,000 entries in my test sample and I will have tens of thousands in my final copy.
Any help would be much appreciated!
You are looking at JSON transformation, and ofcourse can be achieved with a custom programming. I can explain you how you can achieve this in Java, but functionally its gonna be the same for any programming of your choice.
Your input json will look like this:
[{
"epsilon_id": 194029423,
"weather": "cloudy",
"temperature": 27
},
{
"epsilon_id": 932856192,
"weather": "sunny",
"temperature": 31
}]
When you parse in java using popular Jackson library, you will get list of object for below class:
class Input
{
#JsonProperty(access = Access.WRITE_ONLY)
String epsilon_id,
String weather,
int temperature
}
Then you create a map object Map<Integer, Input>, populate data like below:
Map<Integer, Input> map = new HashMap<>();
for(Input obj : listOfInputs){
map.put(obj.epsilon_id, obj)
};
Serialize your result map using Jackson again to get your desired output format:
{
194029423: {
"weather": "cloudy",
"temperature": 27
},
932856192: {
"weather": "sunny",
"temperature": 31
}
}
If you are not very familiar with Java & Jackson JSON parsing, I found this tutorial with code sample, which will give you headstart.
import csv, json, os
# rename this file or pass it in as process.argv[2]
# then pipe output into another file. or
with open("./foo.csv") as f:
output = {}
for line in csv.DictReader(f):
key = line.pop("epsilon_id")
if output.has_key(key):
print("Duplicate Id -> {} ".format(key))
output[key] = line
# then pipe this output into another file.
print(json.dumps(output, indent=2))
# or write a file
with open("/tmp/foo.json",'w') as f:
json.dump(output, f)
Python makes it pretty easy : this will detect all types from a csv file.
demo : https://repl.it/#markboyle/UsableIcyFeed

json2sstable error during conversion from json to sstable

Here I have a json input which I want to import into cassandra so i am using json2stable as below
./json2sstable -K yelp -c business /home/srinath/Desktop/test.json /home/srinath/Desktop/CD/Cassandra/cassandra/data/yelp/business/Standard1-e-1-Data.db
Output:
ERROR 15:03:02,594 Unable to initialize MemoryMeter (jamm not specified as javaagent). This means Cassandra will be unable to measure object sizes accurately and may consequently OOM.
org.codehaus.jackson.map.JsonMappingException: Can not deserialize instance of java.lang.Object[] out of START_OBJECT token
at [Source: /home/srinath/Desktop/test.json; line: 1, column: 1]
at org.codehaus.jackson.map.JsonMappingException.from(JsonMappingException.java:163)
at org.codehaus.jackson.map.deser.StdDeserializationContext.mappingException(StdDeserializationContext.java:219)
at org.codehaus.jackson.map.deser.StdDeserializationContext.mappingException(StdDeserializationContext.java:212)
at org.codehaus.jackson.map.deser.std.ObjectArrayDeserializer.handleNonArray(ObjectArrayDeserializer.java:177)
at org.codehaus.jackson.map.deser.std.ObjectArrayDeserializer.deserialize(ObjectArrayDeserializer.java:88)
at org.codehaus.jackson.map.deser.std.ObjectArrayDeserializer.deserialize(ObjectArrayDeserializer.java:18)
at org.codehaus.jackson.map.ObjectMapper._readValue(ObjectMapper.java:2695)
at org.codehaus.jackson.map.ObjectMapper.readValue(ObjectMapper.java:1294)
at org.codehaus.jackson.JsonParser.readValueAs(JsonParser.java:1368)
at org.apache.cassandra.tools.SSTableImport.importUnsorted(SSTableImport.java:351)
at org.apache.cassandra.tools.SSTableImport.importJson(SSTableImport.java:335)
at org.apache.cassandra.tools.SSTableImport.main(SSTableImport.java:559)
ERROR: Can not deserialize instance of java.lang.Object[] out of START_OBJECT token
at [Source: /home/srinath/Desktop/test.json; line: 1, column: 1]
================================================================================================================================================
Sample Json:
{
"business_id": "qarobAbxGSHI7ygf1f7a_Q",
"full_address": "891 E Baseline Rd\nSuite 102\nGilbert, AZ 85233",
"open": true,
"categories": [
"Sandwiches",
"Restaurants"
],
"city": "Gilbert",
"review_count": 10,
"name": "Jersey Mike's Subs",
"neighborhoods": [],
"longitude": -111.8120071,
"state": "AZ",
"stars": 3.5,
"latitude": 33.3788385,
"type": "business"
}
cid | key | ts
-----+------+-----
101 | ramu | 999
[{
"columns":[["cid",101],["key","ramu"],["ts",687]]
}]
Above json format is based on the above table..
Like that you can prepare your json based on your table format and columns.