How can I filter or select sub-fields of StructType columns in PyArrow - pyarrow

I'm looking for a way to filter and/or select sub-fields of StructType columns. For example in this table:
pylist = [
{'int': 1, 'str': 'a', 'struct':{'sub': 1, 'sub2':3}},
{'int': 2, 'str': 'b', 'struct':{'sub': 2, 'sub2':3}}
]
my_table = pa.Table.from_pylist(pylist)
my_table["struct"]
I want a way to select struct.sub. Is this possible?
Ideally, I'd like to be able to filter based on values in the sub-field. Something like this:
my_table.filter(pa.compute.equal(my_table.column('struct').field('sub'), 1))

Would flattening the table work for your use case?
>>> my_table.flatten()
pyarrow.Table
int: int64
str: string
struct.sub: int64
struct.sub2: int64
----
int: [[1,2]]
str: [["a","b"]]
struct.sub: [[1,2]]
struct.sub2: [[3,3]]
You can then do something like this:
>>> my_table.flatten()["struct.sub"]
<pyarrow.lib.ChunkedArray object at 0x7fac31ff9b20>
[
[
1,
2
]
]

In 7.0.0 you can use the struct_field kernel:
>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> import pyarrow.dataset as ds
>>>
>>> pa.__version__
'7.0.0'
>>>
>>> pylist = [
... {'int': 1, 'str': 'a', 'struct':{'sub': 1, 'sub2':3}},
... {'int': 2, 'str': 'b', 'struct':{'sub': 2, 'sub2':3}}
... ]
>>> my_table = pa.Table.from_pylist(pylist)
>>>
>>> # Select
>>> pc.struct_field(my_table['struct'], [0])
<pyarrow.lib.ChunkedArray object at 0x7fec2f499cb0>
[
[
1,
2
]
]
>>> pc.struct_field(my_table['struct'], [1])
<pyarrow.lib.ChunkedArray object at 0x7fec2f499d50>
[
[
3,
3
]
]
>>>
>>> # Filter
>>> my_table.filter(pc.equal(pc.struct_field(my_table['struct'], [0]), 1))
pyarrow.Table
int: int64
str: string
struct: struct<sub: int64, sub2: int64>
child 0, sub: int64
child 1, sub2: int64
----
int: [[1]]
str: [["a"]]
struct: [ -- is_valid: all not null -- child 0 type: int64
[
1
] -- child 1 type: int64
[
3
]]
In 8.0.0 you can also use the query engine:
>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> import pyarrow.dataset as ds
>>>
>>> pa.__version__
'8.0.0.dev477'
>>>
>>> pylist = [
... {'int': 1, 'str': 'a', 'struct':{'sub': 1, 'sub2':3}},
... {'int': 2, 'str': 'b', 'struct':{'sub': 2, 'sub2':3}}
... ]
>>> my_table = pa.Table.from_pylist(pylist)
>>>
>>> # Select
>>> ds.dataset(my_table).to_table(columns={'sub': ds.field('struct', 'sub')})
pyarrow.Table
sub: int64
----
sub: [[1,2]]
>>>
>>> # Filter
>>> ds.dataset(my_table).to_table(filter=(ds.field('struct', 'sub') == 1))
pyarrow.Table
int: int64
str: string
struct: struct<sub: int64, sub2: int64>
child 0, sub: int64
child 1, sub2: int64
----
int: [[1]]
str: [["a"]]
struct: [
-- is_valid: all not null
-- child 0 type: int64
[1]
-- child 1 type: int64
[3]]

Related

Merge multiple values for same key to one dict/json (Pandas, Python, Dataframe)?

I have the following dataframe:
pd.DataFrame({'id':[1,1,1,2,2], 'key': ['a', 'a', 'b', 'a', 'b'], 'value': ['kkk', 'aaa', '5', 'kkk','8']})
I want to convert it to the following data frame:
id value
1 {'a':['kkk', 'aaa'], 'b': 5}
2 {'a':['kkk'], 'b': 8}
I am trying to do this using .to_dict method but the output is
df.groupby(['id','key']).aggregate(list).groupby('id').aggregate(list)
{'value': {1: [['kkk', 'aaa'], ['5']], 2: [['kkk'], ['8']]}}
Should I perform dict comprehension or there is an efficient logic to build such generic json/dict?
After you groupby(['id', 'key']) and agg(list), you can group by the first level of the index and for each group thereof, use droplevel + to_dict:
new_df = df.groupby(['id', 'key']).agg(list).groupby(level=0).apply(lambda x: x['value'].droplevel(0).to_dict()).reset_index(name='value')
Output:
>>> new_df
id value
0 1 {'a': ['kkk', 'aaa'], 'b': ['5']}
1 2 {'a': ['kkk'], 'b': ['8']}
Or, simpler:
new_df = df.groupby('id').apply(lambda x: x.groupby('key')['value'].agg(list).to_dict())

Opening a json column as a string in pyspark schema and working with it

I have a big dataframe I cannot infer the schema from. I have a column that could be read as if each value is a json format, but I cannot know the full detail of it (i.e. the keys and values can vary and I do not know what it can be).
I want to read it as a string and work with it, but the format changes in a strange way in the process ; here is an example:
from pyspark.sql.types import *
data = [{"ID": 1, "Value": {"a":12, "b": "test"}},
{"ID": 2, "Value": {"a":13, "b": "test2"}}
]
df = spark.createDataFrame(data)
#change my schema to open the column as string
schema = df.schema
j = schema.jsonValue()
j["fields"][1] = {"name": "Value", "type": "string", "nullable": True, "metadata": {}}
new_schema = StructType.fromJson(j)
df2 = spark.createDataFrame(data, schema=new_schema)
df2.show()
Gives me
+---+---------------+
| ID| Value|
+---+---------------+
| 1| {a=12, b=test}|
| 2|{a=13, b=test2}|
+---+---------------+
As one can see, the format in column Value is now without quotes, and with = instead of : and I cannot work properly with it anymore.
How can I turn that back into a StructType or MapType ?
Assuming this is your input dataframe:
df2 = spark.createDataFrame([
(1, "{a=12, b=test}"), (2, "{a=13, b=test2}")
], ["ID", "Value"])
You can use str_to_map function after removing {} from the string column like this:
from pyspark.sql import functions as F
df = df2.withColumn(
"Value",
F.regexp_replace("Value", "[{}]", "")
).withColumn(
"Value",
F.expr("str_to_map(Value, ', ', '=')")
)
df.printSchema()
#root
# |-- ID: long (nullable = true)
# |-- Value: map (nullable = true)
# | |-- key: string
# | |-- value: string (valueContainsNull = true)
df.show()
#+---+---------------------+
#|ID |Value |
#+---+---------------------+
#|1 |{a -> 12, b -> test} |
#|2 |{a -> 13, b -> test2}|
#+---+---------------------+

Double Nested JSON to DF

I'm unable to make this JSON:
{
“profiles”: {
“1”: {
“id”: “1”,
“property1”: “value1”,
“property2”: “value2”
},
“2”: {
“id”: “2”,
“property1”: “value21”,
“property2”: “value22”
}
}}
To this format
Desired output
Id Property1 Property2
1 Value1 Value2
2 Value21 Value22
I've attempted different approaches, that just result in one col all data.
Can someone please orient me on this?
Based on this example:
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
pd.DataFrame.from_dict(data)
col_1 col_2
0 3 a
1 2 b
2 1 c
3 0 d
I would suggest something like:
your_json = {<your_json>}
property1 = []
property2 = []
for key, value in your_json.items():
for k, v in value.items():
property1.append(v['property1'])
property2.append(v['property2'])
data = {'property1': property1, 'property2': property2}
tt = pd.DataFrame.from_dict(data)
print(tt)

Spark JSON reading fields that are completional in JSON into case classes

Consider the following case class schema,
case class Y (a: String, b: String)
case class X (dummy: String, b: Y)
The field b is optional, some of my data sets don't have field b. When I try to read a JSON string that doesn't contain I receive a field missing exception.
spark.read.json(Seq("{'dummy': '1', 'b': {'a': '1'}}").toDS).as[X]
org.apache.spark.sql.AnalysisException: No such struct field b in a;
at org.apache.spark.sql.catalyst.expressions.ExtractValue$.findField(complexTypeExtractors.scala:85)
at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:53)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$resolveExpression$1.applyOrElse(Analyzer.scala:1074)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$resolveExpression$1.applyOrElse(Analyzer.scala:1065)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:282)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:282)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
How do I automatically deserialize fields that aren't present in the JSON to be as null ?
Define b field as Option type and use encoders to create struct type schema.
Pass the defined schema using .schema option with the case class X to create dataset!
Example:
case class Y (a: String, b: Option[String] = None)
case class X (dummy: String, b: Y)
import org.apache.spark.sql.Encoders
val schema = Encoders.product[X].schema
spark.read.schema(schema).json(Seq("{'dummy': '1', 'b': {'a': '1'}}").toDS).as[X].show()
//+-----+----+
//|dummy| b|
//+-----+----+
//| 1|[1,]|
//+-----+----+
Select b column from struct type:
spark.read.schema(schema).json(Seq("{'dummy': '1', 'b': {'a': '1'}}").toDS).as[X].
select("b.b").show()
//+----+
//| b|
//+----+
//|null|
//+----+
PrintSchema:
spark.read.schema(schema).json(Seq("{'dummy': '1', 'b': {'a': '1'}}").toDS).as[X].printSchema
//root
//|-- dummy: string (nullable = true)
//|-- b: struct (nullable = true)
//| |-- a: string (nullable = true)
//| |-- b: string (nullable = true)

pyspark convert row to json with nulls

Goal:
For a dataframe with schema
id:string
Cold:string
Medium:string
Hot:string
IsNull:string
annual_sales_c:string
average_check_c:string
credit_rating_c:string
cuisine_c:string
dayparts_c:string
location_name_c:string
market_category_c:string
market_segment_list_c:string
menu_items_c:string
msa_name_c:string
name:string
number_of_employees_c:string
number_of_rooms_c:string
Months In Role:integer
Tenured Status:string
IsCustomer:integer
units_c:string
years_in_business_c:string
medium_interactions_c:string
hot_interactions_c:string
cold_interactions_c:string
is_null_interactions_c:string
I want to add a new column that is a JSON string of all keys and values for the columns. I have used the approach in this post PySpark - Convert to JSON row by row and related questions.
My code
df = df.withColumn("JSON",func.to_json(func.struct([df[x] for x in small_df.columns])))
I am having one issue:
Issue:
When any row has a null value for a column (and my data has many...) the Json string doesn't contain the key. I.e. if only 9 out of the 27 columns have values then the JSON string only has 9 keys... What I would like to do is maintain all keys but for the null values just pass an empty string ""
Any tips?
You should be able to just modify the answer on the question you linked using pyspark.sql.functions.when.
Consider the following example DataFrame:
data = [
('one', 1, 10),
(None, 2, 20),
('three', None, 30),
(None, None, 40)
]
sdf = spark.createDataFrame(data, ["A", "B", "C"])
sdf.printSchema()
#root
# |-- A: string (nullable = true)
# |-- B: long (nullable = true)
# |-- C: long (nullable = true)
Use when to implement if-then-else logic. Use the column if it is not null. Otherwise return an empty string.
from pyspark.sql.functions import col, to_json, struct, when, lit
sdf = sdf.withColumn(
"JSON",
to_json(
struct(
[
when(
col(x).isNotNull(),
col(x)
).otherwise(lit("")).alias(x)
for x in sdf.columns
]
)
)
)
sdf.show()
#+-----+----+---+-----------------------------+
#|A |B |C |JSON |
#+-----+----+---+-----------------------------+
#|one |1 |10 |{"A":"one","B":"1","C":"10"} |
#|null |2 |20 |{"A":"","B":"2","C":"20"} |
#|three|null|30 |{"A":"three","B":"","C":"30"}|
#|null |null|40 |{"A":"","B":"","C":"40"} |
#+-----+----+---+-----------------------------+
Another option is to use pyspark.sql.functions.coalesce instead of when:
from pyspark.sql.functions import coalesce
sdf.withColumn(
"JSON",
to_json(
struct(
[coalesce(col(x), lit("")).alias(x) for x in sdf.columns]
)
)
).show(truncate=False)
## Same as above