How to get map keys from Arrow dataset - pyarrow

What is the recommended approach to obtain a unique list of map keys from an Arrow dataset?
For a dataset with schema containing:
...
PARQUET:field_id: '19'
detail: map<string, struct<reported: bool, incidents_per_month: int32>
...
Sample data:
"detail": {"a": {"reported": true, "incidents_per_month: 3}, "b": {"reported": true, "incidents_per_month: 3}},
"detail": {"c": {"reported": false, "incidents_per_month: 3}}
What is the right approach to obtaining a list of unique map keys for field detail? i.e. a,b,c
Currrent (slow) approach:
map_data = dataset.field('a)
map_keys = list(set([key for chunk in map_data.iterchunks() for key in chunk.keys.unique().tolist()]))

You already found the .keys attribute of a MapArray. This gives an array of all keys, of which you can take the unique values.
But a dataset (Table) can consist of many chunks, and then accessing the data of a column gives a ChunkedArray which doesn't have this keys attribute. For that reason, you loop over the different chunks, and combine the unique values of all of those.
For now, looping over the chunks is still needed I think, but calculating the overall uniques can be done a bit more efficiently with pyarrow:
# set-up small example
map_type = pa.map_(pa.string(), pa.struct([('reported', pa.bool_()), ('incidents_per_month', pa.int32())]))
values = [
[("a", {"reported": True, "incidents_per_month": 3}), ("b", {"reported": True, "incidents_per_month": 3})],
[("c", {"reported": False, "incidents_per_month": 3})]
]
dataset = pa.table({'detail': pa.array(values, map_type)})
# then creating a chunked array of keys
map_data = dataset.column('detail')
keys = pa.chunked_array([chunk.keys for chunk in map_data.iterchunks()])
# and taking the unique of those in one go:
>>> keys.unique()
<pyarrow.lib.StringArray object at 0x7fbc578af940>
[
"a",
"b",
"c"
]
For optimal efficiency, it would still be good to avoid the python loop of pa.chunked_array([chunk.keys for chunk in map_data.iterchunks()]), and for this I opened https://issues.apache.org/jira/browse/ARROW-12564 to track this enhancement.

Related

Postgresql - Renaming the key of all objects in an array of a JSON column

PostgreSQL 13
The goal is to rename all src keys in the photos array to image.
I have a table plans which has a column json with a simplified structure similar to the below sample.
{
"id": "some id",
"name": "some name",
"tags": [
{
"keyId": 123,
"valueId": 123
},
{
"keyId": 123,
"valueId": 123
}
],
"score": 123,
"photos": [
{
"src": "someString"
},
{
"src": "someString"
}
],
"payment": true
}
The number of objects in the photos array varies, but in general, it is less than 10, so a non-iterating method would be fine, too.
I tried something like this, but it is only good for modifying the value of a key, not the name of the key itself.
UPDATE
plans
SET
json = jsonb_set(json::jsonb, '{photos, 0, src}', '"image"')
;
With the following attempt, I was actually able to rename the key but it overwrites everything else, so only an object with {"image": "someUrl"} is left:
UPDATE
plans
SET
json = (json -> 'photos' ->> 0)::jsonb - 'src' || jsonb_build_object ('image',
json::jsonb -> 'photos' -> 0 -> 'src')
WHERE json::jsonb ? 'photos' = true;
Is there a way to rename keys as expected?
So in the end I used a variation of my initial jsonb_set method. The solution isn't elegant or efficient, but since it is a one-time operation, it was only important to work:
UPDATE
plans
SET
json = jsonb_set(json::jsonb, '{photos, 0, imageUrl}', (json->'photos'->0->'src')::jsonb)
WHERE
json->'photos'->0->'src' IS NOT NULL
;
This query would add the imageUrl key with the existing value of the src key for the first object (position 0) in the photos array. So it left me with src and imageUrl key.
To remove the src key, I ran the following query
UPDATE
plans
SET
json = json::jsonb #- '{ photos, 0, src}'
;
Repeating this as many times as the maximum number of elements in a photos array eventually solved the issue for me.

Update JSON Array in Postgres with specific key

I have a complex array which look like following in a table column:
{
"sometag": {},
"where": [
{
"id": "Krishna",
"nick": "KK",
"values": [
"0"
],
"function": "ADD",
"numValue": [
"0"
]
},
{
"id": "Krishna1",
"nick": "KK1",
"values": [
"0"
],
"function": "SUB",
"numValue": [
"0"
]
}
],
"anotherTag": [],
"TagTag": {
"tt": "tttttt",
"tt1": "tttttt"
}
In this array, I want to update the function and numValue of id: "Krishna".
Kindly help.
This is really nasty because
Updating an element inside a JSON array always requires to expand the array
On-top: The array is nested
The identfier for the elements to update is a sibling not a parent, which means, you have to filter by a sibling
So I came up with a solution, but I want to disclaim: You should avoid doing this as regular database action! Better would be:
Parsing your JSON in the backend and do the operations in your backend code
Normalize the JSON in your database if that would be a common task, meaning: Create tables with appropriate columns and extract your JSON into the table structure. Do not store entire JSON objects in the database! That would make every single task much more easier and incredible more performant!
demo:db<>fiddle
SELECT
jsonb_set( -- 5
(SELECT mydata::jsonb FROM mytable),
'{where}',
updated_array
)::json
FROM (
SELECT
jsonb_agg( -- 4
CASE WHEN array_elem ->> 'id' = 'Krishna' THEN
jsonb_set( -- 3
jsonb_set(array_elem.value::jsonb, '{function}', '"ADDITION"'::jsonb), -- 2
'{numValue}',
'["0","1"]'::jsonb
)
ELSE array_elem::jsonb END
) as updated_array
FROM mytable,
json_array_elements(mydata -> 'where') array_elem -- 1
) s
Extract the nested array elements into one element per row
Replace function value. Note the casts from type json to type jsonb. That is necessary because there's no json_set() function but only jsonb_set(). Naturally, if you just have type jsonb, the casts are not necessary.
Replace numValue value
Reaggregate the array
Replace the where value of the original JSON object with the newly created array object.

How can I load the following JSON (deeply nested) to a DataFrame?

A sample of the JSON is as shown below:
{
"AN": {
"dates": {
"2020-03-26": {
"delta": {
"confirmed": 1
},
"total": {
"confirmed": 1
}
}
}
},
"KA": {
"dates": {
"2020-03-09": {
"delta": {
"confirmed": 1
},
"total": {
"confirmed": 1
}
},
"2020-03-10": {
"delta": {
"confirmed": 3
},
"total": {
"confirmed": 4
}
}
}
}
}
I would like to load it into a DataFrame, such that the state names (AN, KA) are represented as Row names, and the dates and nested entries are present as Columns.
Any tips to achieve this would be very much appreciated. [I am aware of json_normalize, however I haven't figured out how to work it out yet.]
The output I am expecting, is roughly as shown below:
Can you update your post with the DataFrame you have in mind ? It'll be easier to understand what you want.
Also sometimes it's better to reshape your data if you can't make it work the way they are now.
Update:
Following your update here's what you can do.
You need to reshape your data, as I said when you can't achieve what you want it is best to look at the problem from another point of view. For instance (and from the sample you shared) the 'dates' keys is meaningless as the other keys are already dates and there are no other keys ate the same level.
A way to achieve what you want would be to use MultiIndex, it'll help you group your data the way you want. To use it you can for instance create all the indices you need and store in a dictionary the values associated.
Example :
If the only index you have is ('2020-03-26', 'delta', 'confirmed') you should have values = {'AN' : [1], 'KA':None}
Then you only need to create your DataFrame and transpose it.
I gave it a quick try and came up with a piece of code that should work. If you're looking for performance I don't think this will do the trick.
import pandas as pd
# d is the sample you shared
index = [[],[],[]]
values = {}
# Get all the dates
dates = [date for c in d.keys() for date in d[c]['dates'].keys() ]
for country in d.keys():
# For each country we create an array containing all 6 values for each date
# (missing values as None)
values[country] = []
for date in dates:
if date in d[country]['dates']:
for method in ['delta', 'total']:
for step in ['confirmed', 'recovered', 'tested']:
# Incrementing indices
index[0].append(date)
index[1].append(method)
index[2].append(step)
if step in value.keys():
values[country].append(deepcopy(d[country]['dates'][date][method][step]))
else :
values[country].append(None)
# When country does not have a date fill with None
else :
for method in ['delta', 'total']:
for step in ['confirmed', 'recovered', 'tested']:
index[0].append(date)
index[1].append(method)
index[2].append(step)
values[country].append(None)
# Removing duplicates introduced because we added n_countries times
# the indices
# 3 is the number of steps
# 2 is the number of methods
number_of_rows = 3*2*len(dates)
index[0] = index[0][:number_of_rows]
index[1] = index[1][:number_of_rows]
index[2] = index[2][:number_of_rows]
df = pd.DataFrame(values, index=index).T
Here is what I have for the transposed data frame of my output :
Hope this can help you
You clearly needs to reshape your json data before load it into a DataFrame.
Have you tried load your json like a dict ?
dataframe = pd.DataFrame.from_dict(JsonDict, orient="index")
The “orient” of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.

jq: Turn an array of objects into individual objects and use each array index as a new key

I have several large json objects (think GB scale), where the object values in some of the innermost levels are arrays of objects. I'm using jq 1.4 and I'm trying to break these arrays into individual objects, each of which will have a key such as g__0 or g__1, where the numbers correspond to the index in the original array, as returned by the keys function. The number of objects in each array may be arbitrarily large (in my example it is equal to 3). At the same time I want to keep the remaining structure.
For what it's worth the original structure comes from MongoDB, but I am unable to change it at this level. I will then use this json file to create a schema for BigQuery, where an example column will be seeds.g__1.guid and so on.
What I have:
{
"port": 4500,
"notes": "This is an example",
"seeds": [
{
"seed": 12,
"guid": "eaf612"
},
{
"seed": 23,
"guid": "bea143"
},
{
"seed": 38,
"guid": "efk311"
}
]
}
What I am hoping to achieve:
{
"port": 4500,
"notes": "This is an example",
"seeds": {
"g__0": {
"seed": 12,
"guid": "eaf612"
},
"g__1": {
"seed": 23,
"guid": "bea143"
},
"g__2": {
"seed": 38,
"guid": "efk311"
}
}
}
Thanks!
The following jq program should do the trick. At least it produces the desired results for the given JSON. The program is so short and straightforward that I'll let it speak for itself:
def array2object(prefix):
. as $in
| reduce range(0;length) as $i ({}; .["\(prefix)_\($i)"] = $in[$i]);
.seeds |= array2object("g__")
So, you essentially want to transpose (pivot) your data in BigQuery Table such that instead of having data in rows as below
you will have your data in columns as below
Thus, my recommendation would be
First, load your data as is to start with
So now, instead of doing schema transformation outside of BigQuery, let’s rather do it within BigQuery!
Below would be an example of how to achieve transformation you are looking for (assuming you have max three items/objects in array)
#standardSQL
SELECT
port, notes,
STRUCT(
seeds[SAFE_OFFSET(0)] AS g__0,
seeds[SAFE_OFFSET(1)] AS g__1,
seeds[SAFE_OFFSET(2)] AS g__2
) AS seeds
FROM yourTable
You can test this with dummy data using CTE like below
#standardSQL
WITH yourTable AS (
SELECT
4500 AS port, 'This is an example' AS notes,
[STRUCT<seed INT64, guid STRING>
(12, 'eaf612'), (23, 'bea143'), (38, 'efk311')
] AS seeds
UNION ALL SELECT
4501 AS port, 'This is an example 2' AS notes,
[STRUCT<seed INT64, guid STRING>
(42, 'eaf412'), (53, 'bea153')
] AS seeds
)
SELECT
port, notes,
STRUCT(
seeds[SAFE_OFFSET(0)] AS g__0,
seeds[SAFE_OFFSET(1)] AS g__1,
seeds[SAFE_OFFSET(2)] AS g__2
) AS seeds
FROM yourTable
So, technically, if you know max number of items/object in seeds array – you can just manually write needed SQL statement, to run it against real data.
Hope you got an idea
Of course you can script /automate process – you can find examples for similar pivoting tasks here:
https://stackoverflow.com/a/40766540/5221944
https://stackoverflow.com/a/42287566/5221944

How to add nested json object to Lucene Index

I need a little help regarding lucene index files, thought, maybe some of you guys can help me out.
I have json like this:
[
{
"Id": 4476,
"UrlName": null,
"PhoneData": [
{
"PhoneType": "O",
"PhoneNumber": "0065898",
},
{
"PhoneType": "F",
"PhoneNumber": "0065898",
}
],
"Contact": [],
"Services": [
{
"ServiceId": 10,
"ServiceGroup": 2
},
{
"ServiceId": 20,
"ServiceGroup": 1
}
],
}
]
Adding first two fields is relatively easy:
// add lucene fields mapped to db fields
doc.Add(new Field("Id", sampleData.Id.Value.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("UrlName", sampleData.UrlName.Value ?? "null" , Field.Store.YES, Field.Index.ANALYZED));
But how I can add PhoneData and Services to index so it can be connected to unique Id??
For indexing JSON objects I would go this way:
Store the whole value under a payload field, named for example $json. This field would be stored but not indexed.
For each (indexable) property (maybe nested) create an indexable field with its name as a XMLPath-like expression identifying the property, for example PhoneData.PhoneType
If is ok that all nested properties will be indexed then it's simple, just iterate over all of them generating this indexable field.
But if you don't want to index all of them (a more realistic case), how to know which property is indexable is another problem; in this case you could:
Accept from the client the path expressions of the index fields to be created when storing the document, or
Put JSON Schema into play to describe your data (assuming your JSON records have a common schema), and extend it with a custom property that would allow you to tag which properties are indexable.
I have created a library doing this (and much more) that maybe can help you.
You can check it at https://github.com/brutusin/flea-db