Convert Parquet to JSON [duplicate] - json

I have the following sample DataFrame:
a | b | c |
1 | 2 | 4 |
0 | null | null|
null | 3 | 4 |
And I want to replace null values only in the first 2 columns - Column "a" and "b":
a | b | c |
1 | 2 | 4 |
0 | 0 | null|
0 | 3 | 4 |
Here is the code to create sample dataframe:
rdd = sc.parallelize([(1,2,4), (0,None,None), (None,3,4)])
df2 = sqlContext.createDataFrame(rdd, ["a", "b", "c"])
I know how to replace all null values using:
df2 = df2.fillna(0)
And when I try this, I lose the third column:
df2 = df2.select(df2.columns[0:1]).fillna(0)

df.fillna(0, subset=['a', 'b'])
There is a parameter named subset to choose the columns unless your spark version is lower than 1.3.1

Use a dictionary to fill values of certain columns:
df.fillna( { 'a':0, 'b':0 } )

Related

Parse JSON Array where each member has different schema but same general structure

I have a JSON data feed coming into SQL Server 2016. One of the attributes I must parse contains a JSON array. Unfortunately, instead of implementing a key/value design, the source system sends each member of the array with a different attribute name. The attribute names are not known in advance, and are subject to change/volatility.
declare #json nvarchar(max) =
'{
"objects": [
{"foo":"fooValue"},
{"bar":"barValue"},
{"baz":"bazValue"}
]
}';
select * from openjson(json_query(#json, 'strict $.objects'));
As you can see:
element 0 has a "foo" attribute
element 1 has a "bar" attribute
element 2 has a "baz" attribute:
+-----+--------------------+------+
| key | value | type |
+-----+--------------------+------+
| 0 | {"foo":"fooValue"} | 5 |
| 1 | {"bar":"barValue"} | 5 |
| 2 | {"baz":"bazValue"} | 5 |
+-----+--------------------+------+
Ideally, I would like to parse and project the data like so:
+-----+---------------+----------------+------+
| key | attributeName | attributeValue | type |
+-----+---------------+----------------+------+
| 0 | foo | fooValue | 5 |
| 1 | bar | barValue | 5 |
| 2 | baz | bazValue | 5 |
+-----+---------------+----------------+------+
Reminder: The attribute names are not known in advance, and are subject to change/volatility.
select o.[key], v.* --v.[key] as attributeName, v.value as attributeValue
from openjson(json_query(#json, 'strict $.objects')) as o
cross apply openjson(o.[value]) as v;

MySQL JSON wildcards

I'm trying to obtain all IDs from table A of the elements which are in a JSON array in table B. My problem is that I don't know how to use wildcard symbols with JSON.
Table A looks like this
+-----------------------+--------------------+
| Identifier (TINYTEXT) | Filter (JSON) |
+-----------------------+--------------------+
| Obj1 | ['Test1', 'Test2'] |
| Obj2 | ['Test3', 'Test4'] |
+-----------------------+--------------------+
and table B looks like this
+-----------+--------------------+
| UID (INT) | Object (TINYTEXT) |
+-----------+--------------------+
| 1 | xyzTest1, abc |
| 2 | xyzTest2, abc |
| 3 | xyzTest3, abc |
| 4 | xyzTest4, abc |
+----------------+---------------+
I want to use A.Identifier as the input to get B.UID as the output, applying the filter A.Filter on B.Object using wildcard symbols.
Example:
I have A.Identifier = 'Obj1' and want to find all B.UID for the corresponding B.Object that contain Test1 or Test2 (A.Filter). In this case, the output would be 1 and 2.
The SQL code without the inner join I would manually use for this is
SELECT UID FROM B WHERE Object LIKE '%Test1%' OR Object LIKE '%Test2%';

Pyspark - getting values from an array that has a range of min and max values

I'm trying to write a query in PySpark that will get the correct value from an array.
For example, I have dataframe called df with three columns, 'companyId', 'companySize' and 'weightingRange'. The 'companySize' column is just the number of employees. The column 'weightingRange' is an array with the following in it
[ {"minimum":0, "maximum":100, "weight":123},
{"minimum":101, "maximum":200, "weight":456},
{"minimum":201, "maximum":500, "weight":789}
]
so the dataframe looks like this (weightingRange is as above, its truncated in the below example for clearer formating)
+-----------+-------------+------------------------+--+
| companyId | companySize | weightingRange | |
+-----------+-------------+------------------------+--+
| ABC1 | 150 | [{"maximum":100, etc}] | |
| ABC2 | 50 | [{"maximum":100, etc}] | |
+-----------+-------------+------------------------+--+
So for a entry for company size = 150 I need to return the weight 456 into a column called 'companyWeighting'
So it should show the following
+-----------+-------------+------------------------+------------------+
| companyId | companySize | weightingRange | companyWeighting |
+-----------+-------------+------------------------+------------------+
| ABC1 | 150 | [{"maximum":100, etc}] | 456 |
| ABC2 | 50 | [{"maximum":100, etc}] | 123 |
+-----------+-------------+------------------------+------------------+
I've had a look at
df.withColumn("tmp",explode(col("weightingRange"))).select("tmp.*")
and then joining but trying to apply that would Cartesian the data.
Suggestions appreciated!
You can approach like this,
First creating a sample dataframe,
import pyspark.sql.functions as F
df = spark.createDataFrame([
('ABC1', 150, [ {"min":0, "max":100, "weight":123},
{"min":101, "max":200, "weight":456},
{"min":201, "max":500, "weight":789}]),
('ABC2', 50, [ {"min":0, "max":100, "weight":123},
{"min":101, "max":200, "weight":456},
{"min":201, "max":500, "weight":789}])],
['companyId' , 'companySize', 'weightingRange'])
Then, creating a udf function and applying it on each row to get the new column,
def get_weight(wt,wt_rnge):
for _d in wt_rnge:
if _d['min'] <= wt <= _d['max']:
return _d['weight']
get_weight_udf = F.udf(lambda x,y: get_weight(x,y))
df = df.withColumn('companyWeighting', get_weight_udf(F.col('companySize'), F.col('weightingRange')))
df.show()
You get the output as,
+---------+-----------+--------------------+----------------+
|companyId|companySize| weightingRange|companyWeighting|
+---------+-----------+--------------------+----------------+
| ABC1| 150|[Map(weight -> 12...| 456|
| ABC2| 50|[Map(weight -> 12...| 123|
+---------+-----------+--------------------+----------------+

MySQL flexible conversion to numeric data

I have a MySQL database which has several categorical columns. In searching the database, having a conversion of categorical data and data in multiple columns to one numeric variable I could use for sorting would be nice.
Ideally this conversion would be a function and not just another data table since the mapping itself may change. It could probably be as simple the following code, but I’m not sure what the best way to do something like this in SQL would be. Thanks in advance.
a = 0
if b==“val1” {
a += 1
}
if c==2 { a += 2 }
if c==1 { a += 1 }
return a
Where a is the numeric column and b and c are values I’m mapping to a. Same example in table form with everything joined if these columns are in different tables.
+---+------+------+
| a | b | c |
+---+------+------+
| 0 | 3 | xxxx |
| 1 | 1 | xxxx |
| 2 | 2 | xxxx |
| 1 | 3 | val1 |
| 3 | 2 | val1 |
| 2 | 1 | val1 |
+---+------+------+

Convert Json to separate columns in HIVE

I have 4 columns in Hive database table. First two columns are of type string, 3rd and 4th are of JSON. Type. How to extract json data in different columns.
SERDE available in Hive seems to be handling only json data. I have both normal (STRING) and JSON data. How can I extract data in separate colums here.
Example:
abc 2341 {max:2500e0,value:"20",Type:"1",ProviderType:"ABC"} {Name:"ABC",minA:1200e0,StartDate:1483900200000,EndDate:1483986600000,Flags:["flag4","flag3","flag2","flag1"]}
xyz 6789 {max:1300e0,value:"10",Type:"0",ProviderType:"foo"} {Name:"foo",minA:3.14159e0,StartDate:1225864800000,EndDate:1225864800000,Flags:["foo","foo"]}
Given a fixed JSON
create table mytable (str string,i int,jsn1 string, jsn2 string);
insert into mytable values
('abc',2341,'{"max":2500e0,"value":"20","Type":"1","ProviderType":"ABC"}','{"Name":"ABC","minA":1200e0,"StartDate":1483900200000,"EndDate":1483986600000,"Flags":["flag4","flag3","flag2","flag1"]}')
,('xyz',6789,'{"max":1300e0,"value":"10","Type":"0","ProviderType":"foo"}','{"Name":"foo","minA":3.14159e0,"StartDate":1225864800000,"EndDate":1225864800000,"Flags":["foo","foo"]}')
;
select str,i
,jsn1_max,jsn1_value,jsn1_type,jsn1_ProviderType
,jsn2_Name,jsn2_minA,jsn2_StartDate,jsn2_EndDate
,jsn2_Flags
from mytable
lateral view json_tuple (jsn1,'max','value','Type','ProviderType')
j1 as jsn1_max,jsn1_value,jsn1_type,jsn1_ProviderType
lateral view json_tuple (jsn2,'Name','minA','StartDate','EndDate','Flags')
j2 as jsn2_Name,jsn2_minA,jsn2_StartDate,jsn2_EndDate,jsn2_Flags
;
+-----+------+----------+------------+-----------+-------------------+-----------+-----------+----------------+---------------+-----------------------------------+
| str | i | jsn1_max | jsn1_value | jsn1_type | jsn1_providertype | jsn2_name | jsn2_mina | jsn2_startdate | jsn2_enddate | jsn2_flags |
+-----+------+----------+------------+-----------+-------------------+-----------+-----------+----------------+---------------+-----------------------------------+
| abc | 2341 | 2500.0 | 20 | 1 | ABC | ABC | 1200.0 | 1483900200000 | 1483986600000 | ["flag4","flag3","flag2","flag1"] |
| xyz | 6789 | 1300.0 | 10 | 0 | foo | foo | 3.14159 | 1225864800000 | 1225864800000 | ["foo","foo"] |
+-----+------+----------+------------+-----------+-------------------+-----------+-----------+----------------+---------------+-----------------------------------+