Split array of structs from JSON into Dataframe rows in SPARK - json

I am reading Kafka through Spark Structured streaming. The input Kafka message is of the below JSON format:
[
{
"customer": "Jim",
"sex": "male",
"country": "US"
},
{
"customer": "Pam",
"sex": "female",
"country": "US"
}
]
I have the define the schema like below to parse it:
val schemaAsJson = ArrayType(StructType(Seq(
StructField("customer",StringType,true),
StructField("sex",StringType,true),
StructField("country",StringType,true))),true)
My code looks like this,
df.select(from_json($"col", schemaAsJson) as "json")
.select("json.customer","json.sex","json.country")
The current output looks like this,
+--------------+----------------+----------------+
| customer| sex|country |
+--------------+----------------+----------------+
| [Jim, Pam]| [male, female]| [US, US]|
+--------------+----------------+----------------+
Expected output:
+--------------+----------------+----------------+
| customer| sex| country|
+--------------+----------------+----------------+
| Jim| male| US|
| Pam| female| US|
+--------------+----------------+----------------+
How do I split array of structs into individual rows as above? Can someone please help?

You need explode column before selecting.
df.select(explode_outer(from_json($"value", schemaAsJson)) as "json")
.select("json.customer","json.sex","json.country").show()

Related

Pyspark creating a Struct object from a json string of arrays and objects without schema

I have a dataframe with a json string column.
I am trying to turn this json string column into a proper STRUCT object but as you can see my schema is dynamic and can differ for each row. Basically, in some instances I have a json object and then in some its a json array of objects and the number of possible objects in that array can not be known.
I tried this solution but it can only successfully generate schemas for a single object but not an array of objects.
json_schema = spark.read.json(df.rdd.map(lambda row: row.json-string)).schema
df = df.withColumn('new-struct-column', F.from_json(F.col('json-string'), json_schema))
Also, I have an extra key called text being generated by this method and I don't know where it is coming from.
if the json does not contains any nested json, this would helps you:
>>> df.withColumn("correct-json-string", concat(lit("["),regexp_extract(col("json-string"), "\{.*\}", 0), lit("]"))).show(5, False)
+---+------------------------------------------------------+------------------------------------------------------+
|id |json-string |correct-json-string |
+---+------------------------------------------------------+------------------------------------------------------+
|1 |[{"code": 1, "label": "1"}] |[{"code": 1, "label": "1"}] |
|2 |{"code": 2, "label":"2"} |[{"code": 2, "label":"2"}] |
|3 |[{"code": 3, "label": "3"}, {"code": 4, "label": "4"}]|[{"code": 3, "label": "3"}, {"code": 4, "label": "4"}]|
+---+------------------------------------------------------+------------------------------------------------------+

jq get all unique values for a given key in a list of objects

Let's say I have an endpoint that returns the following array:
[
{"name": "Joe", "age": 21},
{"name": "Steve", "age": 27},
{"name": "Michelle", "age": 32},
{"name": "Joe", "age": 23},
]
I know I can get all names using the following command (using httpie):
http https://some-endpoint | jq '.[] | .name'
# output
Joe
Steve
Michelle
Joe
How can I get all unique names (so there are no duplicates)?
Assuming the input is valid JSON, the following jq program will yield an array of the distinct names:
map(.name) | unique
If the input has superfluous commas as in the sample shown, you might wish to consider using a preprocessor, such as any-json or hjson.

Json to CSV issues

I am using pandas to normalize some json data. I am getting stuck on this issue when more than 1 section is either an object or an array.
If i use the record_path on Car it breaks on the second.
Any pointers on how to get something like this to create a line in the csv per Car and per Location?
[
{
"Name": "John Doe",
"Car": [
"Car1",
"Car2"
],
"Location": "Texas"
},
{
"Name": "Jane Roe",
"Car": "Car1",
"Location": [
"Illinois",
"Kansas"
]
}
]
Here is the output
Name,Car,Location
John Doe,"['Car1', 'Car2']",Texas
Jane Roe,Car1,"['Illinois', 'Kansas']"
Here is the code:
with open('file.json') as data_file:
data = json.load(data_file)
df = pd.io.json.json_normalize(data, errors='ignore')
Would like it to end up like this:
Name,Car,Location
John Doe,Car1,Texas
John Doe,Car2,Texas
Jane Roe,Car1,Illinois
Jane Roe,Car1,Kansas
The answers worked great until I ran into a json file with extra data. This what a file looks like with the extra values.
{
Customers:[
{
"Name": "John Doe",
"Car": [
"Car1",
"Car2"
],
"Location": "Texas",
"Repairs: {
"RepairLocations": {
"RepairsCompleted":[
"Fix1",
"Fix2"
]
}
}
},
{
"Name": "Jane Roe",
"Car": "Car1",
"Location": [
"Illinois",
"Kansas"
]
}
]
}
Here is what I am going for. I think its the most readable in this format but anything would at least should all the keys
Name,Car,Location,Repairs:RepairLocation
John Doe,Car1,Texas,RepairsCompleted:Fix1
John Doe,Car1,Texas,RepairsCompleted:Fix2
John Doe,Car2,Texas,RepairsCompleted:Fix1
John Doe,Car2,Texas,RepairsCompleted:Fix2
Jane Roe,Car1,Illinois,
Jane Roe,Car1,Kansas,
Any suggestions on getting this second part?
A simple jq solution which is also a bit more generic than needed here:
["Name", "Car", "Location"],
(.[]
| [.Name] + (.Car|..|scalars|[.]) + (.Location|..|scalars|[.]))
| #csv
You're looking for something like this:
def expand($keys):
. as $in
| reduce $keys[] as $k ( [{}];
map(. + {
($k): ($in[$k] | if type == "array" then .[] else . end)
})
) | .[];
(.[0] | keys_unsorted) as $h
| $h, (.[] | expand($h) | [.[$h[]]]) | #csv
REPL demo

PySpark - Getting list of dicts and converting its keys/values to columns

I have the following json (located in my local file system in path_json):
[
{
"name": "John",
"email": "john#hisemail.com",
"gender": "Male",
"dict_of_columns": [
{
"column_name": "hobbie",
"columns_value": "guitar"
},
{
"column_name": "book",
"columns_value": "1984"
}
]
},
{
"name": "Mary",
"email": "mary#heremail.com",
"gender": "Female",
"dict_of_columns": [
{
"column_name": "language",
"columns_value": "Python"
},
{
"column_name": "job",
"columns_value": "analyst"
}
]
}
]
As you can see, this is a nested json.
I am reading it with the following command:
df = spark.read.option("multiline", "true").json(path_json)
Ok. Now, it produces me the following DataFrame:
+------------------------------------+-------------------+------+----+
|dict_of_columns |email |gender|name|
+------------------------------------+-------------------+------+----+
|[[hobbie, guitar], [book, 1984]] |john#hisemail.com |Male |John|
|[[language, Python], [job, analyst]]|mary#heremail.com |Female|Mary|
+------------------------------------+-------------------+------+----+
I want to know if there is a way to produces the following dataframe:
+----+-----------------+------+------+-------+--------+----+
|book|email |gender|hobbie|job |language|name|
+----+-----------------+------+------+-------+--------+----+
|1984|john#hisemail.com|Male |guitar|null |null |John|
|null|mary#heremail.com|Female|null |analyst|Python |Mary|
+----+-----------------+------+------+-------+--------+----+
A few comments:
My real data has thousands and thousands of lines
I don't know all the column_name in my dataset (there are many of them)
email is unique for each line, so it can be used as key if a join is necessary. I tried this approach before: create a main dataframe with columns [name,gender,email] and other dataframes for each row containing the dictionaries. But without success (and it doesn`t have good performance).
Thanks you so much!

parsing Json data with jq

Need help in parsing Json data using jq , I used to parse the data using json path as [?(#.type=='router')].externalIP. I am not sure how to do the same using jq.
The result from the query should provide the .externalIp from the type=router.
198.22.66.99
Json data snippet as below
[
{
"externalHostName": "localhost",
"externalIP": "198.22.66.99",
"internalHostName": "localhost",
"isUp": true,
"pod": "gateway",
"reachable": true,
"region": "dc-1",
"type": [
"router"
],
"uUID": "b5f986fe-982e-47ae-8260-8a3662f25fc2"
},
]
##
cat your-data.json | jq '.[]|.externalIP|select(type=="string")'
"198.22.66.99"
"192.22.66.29"
"192.22.66.89"
"192.66.22.79"
explanation:
.[] | .externalIP | select(type=="string")
for every array entry | get field 'externalIP' | drop nulls
EDIT/ADDENDUM: filter on type (expects router to be on index 0 of type array)
cat x | jq '.[]|select(.type[0] == "router")|.externalIP'
"198.22.66.99"
"192.22.66.89"
The description:
i would like to extract externalIP for only the array "type": [ "router" ]
The corresponding jq query is:
.[] | select(.type==["router"]) | .externalIP
To base the query on whether "router" is amongst the specified types:
.[] | select(.type|index("router")) | .externalIP