Can I select all same fields of a json field regardless of their path - json

I have a lot of json files stored in a json data type column in postgres. Now there are plenty of places where the key "warning" can apply. Unfortunately I can not get a json schema so I can not know in advance where exactly all the warning keys can show up. So I would like to do something like this:
select report #> '{*,warning}' from foo;
Is there some way to us wildcards in paths? Or is the only way to dynamically traverse a json value lets say key by key recursively in pl/sql function? (if even possible to return a set of cursors as one big cursor).
EDIT:
Interestingly the good old xml data type can do exactly what I need. So I am a bit puzzled why we can not do the same operations on json documents like:
select xmlexists('//town[text() = ''Toronto'']' PASSING BY REF '<root><oldtowns><town>Toronto</town><town>Ottawa</town></oldtowns><newtowns><town>Toronto</town><town>Ottawa</town></newtowns></root>');
select * from xmltable('//town' PASSING by ref '<root><oldtowns><town>Toronto</town><town>Ottawa</town></oldtowns><newtowns><town>Toronto</town><town>Ottawa</town></newtowns></root>' columns town varchar path 'text()')

Related

Creating External Table with Redshift Spectrum from nested JSON

I’m creating an external table from json data with input format org.apache.hadoop.mapred.TextInputFormat and output format org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat with SerDe org.openx.data.jsonserde.JsonSerDe.
One of the attributes of the json is a highly nested json called groups. The nested data doesn't follow a strict schema, so not all json within groups have the same attributes. I'm having trouble accessing group's attributes and I suspect that I am not casting groups to the proper datatype.
Here is a sample of the data
{"entity":"1111111","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"USAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SellerCent":0},"last_updated_by":{"JPAmazon":0}}}}
{"entity":"22222222","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"EUAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SWIPE":0},"last_updated_by":{"SWIPE":0}}}}
{"entity":"3333333","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"EUAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SWIPE":0},"last_updated_by":{"SWIPE":0}}}}
I've tried a couple of different ways of casting the data type of groups when creating the external table. I tried using super type and when I select for groups I get the entire json, but when I select for an attribute of groups such as select groups.sellersAuths from ... or select groups."sellersAuths" from ... I get relation groups does not exist.
I've tried casting it as a struct<key:VARCHAR, value:struct<key:VARCHAR, value:struct<key:VARCHAR, value:FLOAT8>>>, whoever when access something like groups.key or groups.value.key, I always get NULL. I'm not sure how to cast the data type of groups when creating the external table. I'm not sure if my use case is what the super type is for.
I've also tried using JSON_PARSE after I cast the data to VARCHAR, or super or struct but that presents issues as well.
Thanks a ton for reading!

How to query an array field (AWS Glue)?

I have a table in AWS Glue, and the crawler has defined one field as array.
The content is in S3 files that have a json format.
The table is TableA, and the field is members.
There are a lot of other fields such as strings, booleans, doubles, and even structs.
I am able to query them all using a simpel query such as:
SELECT
content.my_boolean,
content.my_string,
content.my_struct.value
FROM schema.tableA;
The issue is when I add content.members into the query.
The error I get is: [Amazon](500310) Invalid operation: schema "content" does not exist.
Content exists because i am able to select other fiels from the main key in the json (content).
Probably is something related with how to perform the query agains array field in Spectrum.
Any idea?
You have to rename the table to extract the fields from the external schema:
SELECT
a.content.my_boolean,
a.content.my_string,
a.content.my_struct.value
FROM schema.tableA a;
I had the same issue on my data, I really don't know why it needs this cast but it works. If you need to access elements of an array you have to explod it like:
SELECT member.<your-field>,
FROM schema.tableA a, a.content.members as member;
Reference
You need to create a Glue Classifier.
Select JSON as Classifier type
and for the JSON Path input the following:
$[*]
then run your crawler. It will infer your schema and populate your table with the correct fields instead of just one big array. Not sure if this was what you were looking for but figured I'd drop this here just in case others had the same problem I had.

How do I convert a column of JSON strings into a parquet table

I am trying to convert some data that I am receiving into a parquet table that I can eventually use for reporting, but feel like I am missing a step.
I receive files that are CSVs where the format is "id", "event", "source" where the "event" column is a GZIP compressed JSON string. I've been able to get a dataframe set up that extracts the three columns, including getting the JSON string unzipped. So I have a table now that has
id | event | source | unencoded_event
Where the unencoded_event is the JSON string.
What I'd like to do at this point is to take that one string column of JSON and parse it out into individual columns. Based on a comment from another developer (that the process of converting to parquet is smart enough to just use the first row of my results to figure out schema), I've tried this:
df1 = spark.read.json(df.select("unencoded_event").rdd).write.format("parquet").saveAsTable("test")
But this just gives me a single column table with a column of _corrupt_record that just has the JSON string again.
What I'm trying to get to is to take schema:
{
"agent"
--"name"
--"organization"
"entity"
--"name"
----"type"
----"value"
}
And get the table to, ultimately, look like:
AgentName | Organization | EventType | EventValue
Is the step I'm missing just explicitly defining the schema or have I oversimplified my approach?
Potential complications here: the JSON schema is actually more involved than above; I've been assuming I can expand out the full schema into a wider table and then just return the smaller set I care about.
I have also tried taking a single result from the file (so, a single JSON string), saving it as a JSON file and trying to read from it. Doing so works, i.e., doing the spark.read.json(myJSON.json) parses the string into the arrays I was expecting. This is also true if I copy multiple strings.
This doesn't work if I take my original results and try to save them. If I try to save just the column of strings as a json file
dfWrite = df.select(col("unencoded_event"))
dfWrite.write.mode("overwrite").json(write_location)
then read them back out, this doesn't behave the same way...each row is still treated as strings.
I did find one solution that works. This is not a perfect solution (I'm worried that it's not scalable), but it gets me to where I need to be.
I can select the data using get_json_object() for each column I want (sorry, I've been fiddling with column names and the like over the course of the day):
dfResults = df.select(get_json_object("unencoded_event", "$.agent[0].name").alias("userID"),
get_json_object("unencoded_event", "$.entity[0].identifier.value").alias("itemID"),
get_json_object("unencoded_event", "$.entity[0].detail[1].value").alias("itemInfo"),
get_json_object("unencoded_event", "$.recorded").alias("timeStamp"))
The big thing I don't love about this is that it appears I can't use filter/search options with get_json_object(). That's fine for the forseeable future, because right now I know where all the data should be and don't need to filter.
I believe I can also use from_json() but that requires defining the schema within the notebook. This isn't a great option because I only need a small part of the JSON, so it feels like unnecessary effort to define the entire schema. (I also don't have control over what the overall schema would be, so this becomes a maintenance issue.)

get json key of postgres column containing a specific word

I'm trying to select a key from my db and set its value in a json column with postgres keys are finishing with "_alert".
So in my bd I have a column named data as a json and i just want the keys finishing with "_alert" like "ram_alert", "temperatures_alert", "disk_alert", "cpu_alert".
So I need to get the key and the value to compare with the data I have in my backend app to validate if I need to update the value or dont.
How to do this?
I get all the keys doing select json_object_keys(data) from devices but how to get the key/value pair.. is there a way to use the "like" expression here?
First off, note that your current query will only work if you have one tuple in your 'devices' table. Try inserting another row and you'll get:
ERROR: cannot call json_object_keys on an array
If you're certain that you're only ever going to have ONE result from this table, then the following query should give you what you want:
SELECT key,value FROM devices,json_each(devices.data) where key ~ '_alert$';
I'd still throw something like "LIMIT 1" onto your query to be safe.

How to get the type of an entry in a GEOMETRY data type column

A column of the data type GEOMETRY can hold either POLYGON, LINESTRING or POINT type data. Is it possible to get the type of the contained geo data, or for example only select rows of a specific type?
Sure, one could use AsText() and regex to get the part before the brackets, but that would be a very unperformant approach to this... Or the data type could be saved into a separate column... But isn't there a built-in function I might just have missed in the MySQL docs?
It looks like you're looking for ST_GeometryType.