Processing a nested dict and list Json to a DataFrame - json

I have a json object like this:
[{"ID": "101",
"OagCode": "1000",
"house": [{"from": [{"CneeCode":"30100"}], "ShprCode": "20100"}]},
{"ID": "102",
"OagCode": "1001",
"house": [{"from": [{"CneeCode":"30101"}], "ShprCode": "20101"},
{"from": [{"CneeCode":"30102"}], "ShprCode": "20102"},
{"from": [{"CneeCode":"30103"}], "ShprCode": "20103"}]}]
I want to convert this json to a dataframe in such a way that the interior list expands and form a dataframe with proper values as follows:
+-----+---------+----------+----------+
| ID | OagCode | CneeCode | ShprCode |
+-----+---------+----------+----------+
| 101 | 1000 | 30100 | 20100 |
| 102 | 1001 | 30101 | 20101 |
| 102 | 1001 | 30102 | 20102 |
| 102 | 1001 | 30103 | 20103 |
+-----+---------+----------+----------+
Is there a way to convert the above stated json to dataframe without using loops?
I have tried orient and it doesn't works.

Use json_normalize:
from pandas.io.json import json_normalize
df = json_normalize(j,record_path='house',meta=['ID','OagCode'])
print (df)
CneeCode ShprCode ID OagCode
0 30100 20100 101 1000
1 30101 20101 102 1001
2 30102 20102 102 1001
3 30103 20103 102 1001

Related

Print key, value pairs from nested MYSQL json

I have extracted a mysql json dictionary strucutre and I wish to get all the values associated with the keys alpha and beta; however I also wish to print the key too. The structure of the dictionary is:
results =
{1:
{"a": {"alpha": 1234,
"beta": 2345},
"b": {"alpha": 1234,
"beta": 2345},
"c": {"alpha": 1234,
"beta": 2345},
},
2:
{"ab": {"alpha": 1234,
"beta": 2345},
"ac": {"alpha": 1234,
"beta": 2345},
"bc": {"alpha": 1234,
"beta": 2345},
},
3:
{"abc": {"alpha": 1234,
"beta": 2345}
}
"random_key": "not_interested_in_this_value"
}
So far I have been had some succes extracting the data I wish using:
SELECT JSON_EXTRACT alpha, beta FROM results;
This gave me the alpha and beta columns; however, I ideally would like to assoicate each value with their key to get:
+-------+---------+---------+
| key | alpha | beta |
+-------+---------+---------+
| a | 1234. | 2345. |
| b | 1234. | 2345. |
| c | 1234. | 2345. |
| ab | 1234. | 2345. |
| ac | 1234. | 2345. |
| bc | 1234. | 2345. |
| abc | 1234. | 2345. |
+-------+---------+---------+
I am very new to mysql and any help is appreciated.
First of all, what you posted is not valid JSON. You can use integers as values, but you can't use integers as keys in objects. Also you have a few spurious , symbols. I had to fix these mistakes before I could insert the data into a table to test.
I was able to solve this using MySQL 8.0's JSON_TABLE() function in the following way:
select
j2.`key`,
json_extract(results, concat('$."',j1.`key`,'"."',j2.`key`,'".alpha')) as alpha,
json_extract(results, concat('$."',j1.`key`,'"."',j2.`key`,'".beta')) as beta
from mytable
cross join json_table(json_keys(results), '$[*]' columns (`key` int path '$')) as j1
cross join json_table(json_keys(json_extract(results, concat('$."',j1.`key`,'"'))), '$[*]' columns (`key` varchar(3) path '$')) as j2
where j2.`key` IS NOT NULL;
Output:
+------+-------+------+
| key | alpha | beta |
+------+-------+------+
| a | 1234 | 2345 |
| b | 1234 | 2345 |
| c | 1234 | 2345 |
| ab | 1234 | 2345 |
| ac | 1234 | 2345 |
| bc | 1234 | 2345 |
| abc | 1234 | 2345 |
+------+-------+------+
If you find this sort of query too difficult, I would encourage you to reconsider whether you want to store data in JSON.
If I were you, I'd store data in normal rows and columns, then the query would be a lot simpler and easier to write and maintain.

Pyspark - getting values from an array that has a range of min and max values

I'm trying to write a query in PySpark that will get the correct value from an array.
For example, I have dataframe called df with three columns, 'companyId', 'companySize' and 'weightingRange'. The 'companySize' column is just the number of employees. The column 'weightingRange' is an array with the following in it
[ {"minimum":0, "maximum":100, "weight":123},
{"minimum":101, "maximum":200, "weight":456},
{"minimum":201, "maximum":500, "weight":789}
]
so the dataframe looks like this (weightingRange is as above, its truncated in the below example for clearer formating)
+-----------+-------------+------------------------+--+
| companyId | companySize | weightingRange | |
+-----------+-------------+------------------------+--+
| ABC1 | 150 | [{"maximum":100, etc}] | |
| ABC2 | 50 | [{"maximum":100, etc}] | |
+-----------+-------------+------------------------+--+
So for a entry for company size = 150 I need to return the weight 456 into a column called 'companyWeighting'
So it should show the following
+-----------+-------------+------------------------+------------------+
| companyId | companySize | weightingRange | companyWeighting |
+-----------+-------------+------------------------+------------------+
| ABC1 | 150 | [{"maximum":100, etc}] | 456 |
| ABC2 | 50 | [{"maximum":100, etc}] | 123 |
+-----------+-------------+------------------------+------------------+
I've had a look at
df.withColumn("tmp",explode(col("weightingRange"))).select("tmp.*")
and then joining but trying to apply that would Cartesian the data.
Suggestions appreciated!
You can approach like this,
First creating a sample dataframe,
import pyspark.sql.functions as F
df = spark.createDataFrame([
('ABC1', 150, [ {"min":0, "max":100, "weight":123},
{"min":101, "max":200, "weight":456},
{"min":201, "max":500, "weight":789}]),
('ABC2', 50, [ {"min":0, "max":100, "weight":123},
{"min":101, "max":200, "weight":456},
{"min":201, "max":500, "weight":789}])],
['companyId' , 'companySize', 'weightingRange'])
Then, creating a udf function and applying it on each row to get the new column,
def get_weight(wt,wt_rnge):
for _d in wt_rnge:
if _d['min'] <= wt <= _d['max']:
return _d['weight']
get_weight_udf = F.udf(lambda x,y: get_weight(x,y))
df = df.withColumn('companyWeighting', get_weight_udf(F.col('companySize'), F.col('weightingRange')))
df.show()
You get the output as,
+---------+-----------+--------------------+----------------+
|companyId|companySize| weightingRange|companyWeighting|
+---------+-----------+--------------------+----------------+
| ABC1| 150|[Map(weight -> 12...| 456|
| ABC2| 50|[Map(weight -> 12...| 123|
+---------+-----------+--------------------+----------------+

MySQL - Search JSON data column

I have a MySQL database column that contains JSON array encoded strings. I would like to search the JSON array where the "Elapsed" value is greater than a particular number and return the corresponding TaskID value of the object the value was found. I have been attempting to use combinations of the JSON_SEARCH, JSON_CONTAINS, and JSON_EXTRACT functions but I am not getting the desired results.
[
{
"TaskID": "TAS00000012344",
"Elapsed": "25"
},
{
"TaskID": "TAS00000012345",
"Elapsed": "30"
},
{
"TaskID": "TAS00000012346",
"Elapsed": "35"
},
{
"TaskID": "TAS00000012347",
"Elapsed": "40"
}
]
Referencing the JSON above, if I search for "Elapsed" > "30" then 2 records would return
'TAS00000012346'
'TAS00000012347'
I am using MySQL version 5.7.11 and new to querying json data. Any help would be appreciated. thanks
With MySQL pre-8.0, there is no easy way to turn a JSON array to a recordset (ie, function JSON_TABLE() is not yet available).
So, one way or another, we need to manually iterate through the array to extract the relevant pieces of data (using JSON_EXTRACT()). Here is a solution that uses an inline query to generate a list of numbers ; another classic approchach is to use a number tables.
Assuming a table called mytable with a column called js holding the JSON content:
SELECT
JSON_EXTRACT(js, CONCAT('$[', n.idx, '].TaskID')) TaskID,
JSON_EXTRACT(js, CONCAT('$[', n.idx, '].Elapsed')) Elapsed
FROM mytable t
CROSS JOIN (
SELECT 0 idx
UNION ALL SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
) n
WHERE JSON_EXTRACT(js, CONCAT('$[', n.idx, '].Elapsed')) * 1.0 > 30
NB: in the WHERE clause, the * 1.0 operation is there to force the conversion to a number.
Demo on DB Fiddle with your sample data:
| TaskID | Elapsed |
| -------------- | ------- |
| TAS00000012346 | 35 |
| TAS00000012347 | 40 |
Yes , you can definitely to it using JSON_EXTRACT() function in mysql.
lets take a table that contains JSON (table client_services here) :
+-----+-----------+--------------------------------------+
| id | client_id | service_values |
+-----+-----------+------------+-------------------------+
| 100 | 1000 | { "quota": 1,"data_transfer":160000} |
| 101 | 1000 | { "quota": 2,"data_transfer":800000} |
| 102 | 1000 | { "quota": 3,"data_transfer":70000} |
| 103 | 1001 | { "quota": 1,"data_transfer":97000} |
| 104 | 1001 | { "quota": 2,"data_transfer":1760} |
| 105 | 1002 | { "quota": 2,"data_transfer":1060} |
+-----+-----------+--------------------------------------+
And now lets say we want client_id for all who have quota>1 , then use this query :
SELECT
id,client_id,
JSON_EXTRACT(service_values, '$.quota') AS quota
FROM client_services
WHERE JSON_EXTRACT(service_values, '$.quota') > 1;
And hence it will result into :
+-----+-----------+-------+
| id | client_id | quota |
+-----+-----------+--------
| 101 | 1000 | 2 |
| 102 | 1000 | 3 |
| 104 | 1001 | 2 |
| 105 | 1002 | 2 |
+-----+-----------+-------+
hope this helps!

Convert Parquet to JSON [duplicate]

I have the following sample DataFrame:
a | b | c |
1 | 2 | 4 |
0 | null | null|
null | 3 | 4 |
And I want to replace null values only in the first 2 columns - Column "a" and "b":
a | b | c |
1 | 2 | 4 |
0 | 0 | null|
0 | 3 | 4 |
Here is the code to create sample dataframe:
rdd = sc.parallelize([(1,2,4), (0,None,None), (None,3,4)])
df2 = sqlContext.createDataFrame(rdd, ["a", "b", "c"])
I know how to replace all null values using:
df2 = df2.fillna(0)
And when I try this, I lose the third column:
df2 = df2.select(df2.columns[0:1]).fillna(0)
df.fillna(0, subset=['a', 'b'])
There is a parameter named subset to choose the columns unless your spark version is lower than 1.3.1
Use a dictionary to fill values of certain columns:
df.fillna( { 'a':0, 'b':0 } )

Convert Json to separate columns in HIVE

I have 4 columns in Hive database table. First two columns are of type string, 3rd and 4th are of JSON. Type. How to extract json data in different columns.
SERDE available in Hive seems to be handling only json data. I have both normal (STRING) and JSON data. How can I extract data in separate colums here.
Example:
abc 2341 {max:2500e0,value:"20",Type:"1",ProviderType:"ABC"} {Name:"ABC",minA:1200e0,StartDate:1483900200000,EndDate:1483986600000,Flags:["flag4","flag3","flag2","flag1"]}
xyz 6789 {max:1300e0,value:"10",Type:"0",ProviderType:"foo"} {Name:"foo",minA:3.14159e0,StartDate:1225864800000,EndDate:1225864800000,Flags:["foo","foo"]}
Given a fixed JSON
create table mytable (str string,i int,jsn1 string, jsn2 string);
insert into mytable values
('abc',2341,'{"max":2500e0,"value":"20","Type":"1","ProviderType":"ABC"}','{"Name":"ABC","minA":1200e0,"StartDate":1483900200000,"EndDate":1483986600000,"Flags":["flag4","flag3","flag2","flag1"]}')
,('xyz',6789,'{"max":1300e0,"value":"10","Type":"0","ProviderType":"foo"}','{"Name":"foo","minA":3.14159e0,"StartDate":1225864800000,"EndDate":1225864800000,"Flags":["foo","foo"]}')
;
select str,i
,jsn1_max,jsn1_value,jsn1_type,jsn1_ProviderType
,jsn2_Name,jsn2_minA,jsn2_StartDate,jsn2_EndDate
,jsn2_Flags
from mytable
lateral view json_tuple (jsn1,'max','value','Type','ProviderType')
j1 as jsn1_max,jsn1_value,jsn1_type,jsn1_ProviderType
lateral view json_tuple (jsn2,'Name','minA','StartDate','EndDate','Flags')
j2 as jsn2_Name,jsn2_minA,jsn2_StartDate,jsn2_EndDate,jsn2_Flags
;
+-----+------+----------+------------+-----------+-------------------+-----------+-----------+----------------+---------------+-----------------------------------+
| str | i | jsn1_max | jsn1_value | jsn1_type | jsn1_providertype | jsn2_name | jsn2_mina | jsn2_startdate | jsn2_enddate | jsn2_flags |
+-----+------+----------+------------+-----------+-------------------+-----------+-----------+----------------+---------------+-----------------------------------+
| abc | 2341 | 2500.0 | 20 | 1 | ABC | ABC | 1200.0 | 1483900200000 | 1483986600000 | ["flag4","flag3","flag2","flag1"] |
| xyz | 6789 | 1300.0 | 10 | 0 | foo | foo | 3.14159 | 1225864800000 | 1225864800000 | ["foo","foo"] |
+-----+------+----------+------------+-----------+-------------------+-----------+-----------+----------------+---------------+-----------------------------------+