I would like to create a json out of 2 dataframes (one is Parent and the other is child). Child records should be an array forming a nested JSON
Df1 (Department):
+----------+------------+
| dept_Id | dept_name |
+----------+------------+
| 10 | Sales |
+----------+------------+
Df2 (Employee):
+----------+--------+----------+
| dept_Id | emp_id | emp_name |
+----------+--------+----------+
| 10 | 1001 | John |
| 10 | 1002 | Rich |
+----------+--------+----------+
I want the JSON to be created as follows:
{
"dept_id":"10",
"dept_name":"Sales",
"employee":[
{ "emp_id":"1001","emp_name":"John" },
{ "emp_id":"1002","emp_name":"Rich" }
]
}
Appreciate your thoughts. Thanks
First join the two dataframes together:
val df = df1.join(df2, Seq("dept_Id"))
Then use groupBy and collect_list. Two case classes is used here to get the correct names in the final json. These should be placed outside of the main method.
case class Department(dept_Id: Int, dept_name: String, employee: Seq[Employee])
case class Employee(emp_id: Int, emp_name: String)
val dfDept = df.groupBy("dept_id", "dept_name")
.agg(collect_list(struct($"emp_id", $"emp_name")).as("employee"))
.as[Department]
Resulting dataframe:
+-------+---------+--------------------------+
|dept_id|dept_name|employee |
+-------+---------+--------------------------+
|10 |Sales |[[1002,Rich], [1001,John]]|
+-------+---------+--------------------------+
Finally, save it as a json file:
dfDept .coalesce(1).write.json("department.json")
Related
I have a JSON data feed coming into SQL Server 2016. One of the attributes I must parse contains a JSON array. Unfortunately, instead of implementing a key/value design, the source system sends each member of the array with a different attribute name. The attribute names are not known in advance, and are subject to change/volatility.
declare #json nvarchar(max) =
'{
"objects": [
{"foo":"fooValue"},
{"bar":"barValue"},
{"baz":"bazValue"}
]
}';
select * from openjson(json_query(#json, 'strict $.objects'));
As you can see:
element 0 has a "foo" attribute
element 1 has a "bar" attribute
element 2 has a "baz" attribute:
+-----+--------------------+------+
| key | value | type |
+-----+--------------------+------+
| 0 | {"foo":"fooValue"} | 5 |
| 1 | {"bar":"barValue"} | 5 |
| 2 | {"baz":"bazValue"} | 5 |
+-----+--------------------+------+
Ideally, I would like to parse and project the data like so:
+-----+---------------+----------------+------+
| key | attributeName | attributeValue | type |
+-----+---------------+----------------+------+
| 0 | foo | fooValue | 5 |
| 1 | bar | barValue | 5 |
| 2 | baz | bazValue | 5 |
+-----+---------------+----------------+------+
Reminder: The attribute names are not known in advance, and are subject to change/volatility.
select o.[key], v.* --v.[key] as attributeName, v.value as attributeValue
from openjson(json_query(#json, 'strict $.objects')) as o
cross apply openjson(o.[value]) as v;
I'm trying to obtain all IDs from table A of the elements which are in a JSON array in table B. My problem is that I don't know how to use wildcard symbols with JSON.
Table A looks like this
+-----------------------+--------------------+
| Identifier (TINYTEXT) | Filter (JSON) |
+-----------------------+--------------------+
| Obj1 | ['Test1', 'Test2'] |
| Obj2 | ['Test3', 'Test4'] |
+-----------------------+--------------------+
and table B looks like this
+-----------+--------------------+
| UID (INT) | Object (TINYTEXT) |
+-----------+--------------------+
| 1 | xyzTest1, abc |
| 2 | xyzTest2, abc |
| 3 | xyzTest3, abc |
| 4 | xyzTest4, abc |
+----------------+---------------+
I want to use A.Identifier as the input to get B.UID as the output, applying the filter A.Filter on B.Object using wildcard symbols.
Example:
I have A.Identifier = 'Obj1' and want to find all B.UID for the corresponding B.Object that contain Test1 or Test2 (A.Filter). In this case, the output would be 1 and 2.
The SQL code without the inner join I would manually use for this is
SELECT UID FROM B WHERE Object LIKE '%Test1%' OR Object LIKE '%Test2%';
I'm trying to write a query in PySpark that will get the correct value from an array.
For example, I have dataframe called df with three columns, 'companyId', 'companySize' and 'weightingRange'. The 'companySize' column is just the number of employees. The column 'weightingRange' is an array with the following in it
[ {"minimum":0, "maximum":100, "weight":123},
{"minimum":101, "maximum":200, "weight":456},
{"minimum":201, "maximum":500, "weight":789}
]
so the dataframe looks like this (weightingRange is as above, its truncated in the below example for clearer formating)
+-----------+-------------+------------------------+--+
| companyId | companySize | weightingRange | |
+-----------+-------------+------------------------+--+
| ABC1 | 150 | [{"maximum":100, etc}] | |
| ABC2 | 50 | [{"maximum":100, etc}] | |
+-----------+-------------+------------------------+--+
So for a entry for company size = 150 I need to return the weight 456 into a column called 'companyWeighting'
So it should show the following
+-----------+-------------+------------------------+------------------+
| companyId | companySize | weightingRange | companyWeighting |
+-----------+-------------+------------------------+------------------+
| ABC1 | 150 | [{"maximum":100, etc}] | 456 |
| ABC2 | 50 | [{"maximum":100, etc}] | 123 |
+-----------+-------------+------------------------+------------------+
I've had a look at
df.withColumn("tmp",explode(col("weightingRange"))).select("tmp.*")
and then joining but trying to apply that would Cartesian the data.
Suggestions appreciated!
You can approach like this,
First creating a sample dataframe,
import pyspark.sql.functions as F
df = spark.createDataFrame([
('ABC1', 150, [ {"min":0, "max":100, "weight":123},
{"min":101, "max":200, "weight":456},
{"min":201, "max":500, "weight":789}]),
('ABC2', 50, [ {"min":0, "max":100, "weight":123},
{"min":101, "max":200, "weight":456},
{"min":201, "max":500, "weight":789}])],
['companyId' , 'companySize', 'weightingRange'])
Then, creating a udf function and applying it on each row to get the new column,
def get_weight(wt,wt_rnge):
for _d in wt_rnge:
if _d['min'] <= wt <= _d['max']:
return _d['weight']
get_weight_udf = F.udf(lambda x,y: get_weight(x,y))
df = df.withColumn('companyWeighting', get_weight_udf(F.col('companySize'), F.col('weightingRange')))
df.show()
You get the output as,
+---------+-----------+--------------------+----------------+
|companyId|companySize| weightingRange|companyWeighting|
+---------+-----------+--------------------+----------------+
| ABC1| 150|[Map(weight -> 12...| 456|
| ABC2| 50|[Map(weight -> 12...| 123|
+---------+-----------+--------------------+----------------+
I have 4 columns in Hive database table. First two columns are of type string, 3rd and 4th are of JSON. Type. How to extract json data in different columns.
SERDE available in Hive seems to be handling only json data. I have both normal (STRING) and JSON data. How can I extract data in separate colums here.
Example:
abc 2341 {max:2500e0,value:"20",Type:"1",ProviderType:"ABC"} {Name:"ABC",minA:1200e0,StartDate:1483900200000,EndDate:1483986600000,Flags:["flag4","flag3","flag2","flag1"]}
xyz 6789 {max:1300e0,value:"10",Type:"0",ProviderType:"foo"} {Name:"foo",minA:3.14159e0,StartDate:1225864800000,EndDate:1225864800000,Flags:["foo","foo"]}
Given a fixed JSON
create table mytable (str string,i int,jsn1 string, jsn2 string);
insert into mytable values
('abc',2341,'{"max":2500e0,"value":"20","Type":"1","ProviderType":"ABC"}','{"Name":"ABC","minA":1200e0,"StartDate":1483900200000,"EndDate":1483986600000,"Flags":["flag4","flag3","flag2","flag1"]}')
,('xyz',6789,'{"max":1300e0,"value":"10","Type":"0","ProviderType":"foo"}','{"Name":"foo","minA":3.14159e0,"StartDate":1225864800000,"EndDate":1225864800000,"Flags":["foo","foo"]}')
;
select str,i
,jsn1_max,jsn1_value,jsn1_type,jsn1_ProviderType
,jsn2_Name,jsn2_minA,jsn2_StartDate,jsn2_EndDate
,jsn2_Flags
from mytable
lateral view json_tuple (jsn1,'max','value','Type','ProviderType')
j1 as jsn1_max,jsn1_value,jsn1_type,jsn1_ProviderType
lateral view json_tuple (jsn2,'Name','minA','StartDate','EndDate','Flags')
j2 as jsn2_Name,jsn2_minA,jsn2_StartDate,jsn2_EndDate,jsn2_Flags
;
+-----+------+----------+------------+-----------+-------------------+-----------+-----------+----------------+---------------+-----------------------------------+
| str | i | jsn1_max | jsn1_value | jsn1_type | jsn1_providertype | jsn2_name | jsn2_mina | jsn2_startdate | jsn2_enddate | jsn2_flags |
+-----+------+----------+------------+-----------+-------------------+-----------+-----------+----------------+---------------+-----------------------------------+
| abc | 2341 | 2500.0 | 20 | 1 | ABC | ABC | 1200.0 | 1483900200000 | 1483986600000 | ["flag4","flag3","flag2","flag1"] |
| xyz | 6789 | 1300.0 | 10 | 0 | foo | foo | 3.14159 | 1225864800000 | 1225864800000 | ["foo","foo"] |
+-----+------+----------+------------+-----------+-------------------+-----------+-----------+----------------+---------------+-----------------------------------+
I have a hive table - Table A as follows:
id | partner | recent_use | count |
1 | ab | 20160101 | 5 |
1 | cd | 20160304 | 12 |
2 | ab | 20160205 | 1 |
2 | cd | 20150101 | 2 |
3 | ab | 20150401 | 4 |
From Table A, I want to end up with a table like this - Table B:
id | partner |
1 | [ ab : { recent_use:20160101, count:5 } , cd : { recent_use:20160304, count:12 } ]
2 | [ ab : { recent_use:20160205, count:1 } , cd : { recent_use:20150101, count:2 } ]
3 | [ ab : { recent_use:20150401, count:4 } ]
Basically, Table B is a nested version of Table A such that for a given id, all the data from each of its partner is grouped into one column.
I have two questions:
How can I create Table B from Table A?
How can I convert Table B into a JSON document such that I can load the document into any NOSQL DB?
Would really appreciate any help on this. Thanks!
Simple to achieve this is using UDAF - user defined aggregation function. You can write custom function to make things simple. Here is some thing you can using inbuilt functions. Give it a try.
select id, CONCAT("[", concat_ws(',', collect_set(CONCAT('"', partner,
'":{ "recent_use":', recent_use, ', "count":', count, "}"))), "]") as
collJ from tableA group by id
Above SQL will get ID and collJ in string you looking for after that can use get_json_object function to convert to JSON object.
Reference
https://www.qubole.com/resources/cheatsheet/hive-function-cheat-sheet/
https://cwiki.apache.org/confluence/display/Hive/GenericUDAFCaseStudy