I have a table in AWS Glue, and the crawler has defined one field as array.
The content is in S3 files that have a json format.
The table is TableA, and the field is members.
There are a lot of other fields such as strings, booleans, doubles, and even structs.
I am able to query them all using a simpel query such as:
SELECT
content.my_boolean,
content.my_string,
content.my_struct.value
FROM schema.tableA;
The issue is when I add content.members into the query.
The error I get is: [Amazon](500310) Invalid operation: schema "content" does not exist.
Content exists because i am able to select other fiels from the main key in the json (content).
Probably is something related with how to perform the query agains array field in Spectrum.
Any idea?
You have to rename the table to extract the fields from the external schema:
SELECT
a.content.my_boolean,
a.content.my_string,
a.content.my_struct.value
FROM schema.tableA a;
I had the same issue on my data, I really don't know why it needs this cast but it works. If you need to access elements of an array you have to explod it like:
SELECT member.<your-field>,
FROM schema.tableA a, a.content.members as member;
Reference
You need to create a Glue Classifier.
Select JSON as Classifier type
and for the JSON Path input the following:
$[*]
then run your crawler. It will infer your schema and populate your table with the correct fields instead of just one big array. Not sure if this was what you were looking for but figured I'd drop this here just in case others had the same problem I had.
Related
I’m creating an external table from json data with input format org.apache.hadoop.mapred.TextInputFormat and output format org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat with SerDe org.openx.data.jsonserde.JsonSerDe.
One of the attributes of the json is a highly nested json called groups. The nested data doesn't follow a strict schema, so not all json within groups have the same attributes. I'm having trouble accessing group's attributes and I suspect that I am not casting groups to the proper datatype.
Here is a sample of the data
{"entity":"1111111","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"USAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SellerCent":0},"last_updated_by":{"JPAmazon":0}}}}
{"entity":"22222222","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"EUAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SWIPE":0},"last_updated_by":{"SWIPE":0}}}}
{"entity":"3333333","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"EUAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SWIPE":0},"last_updated_by":{"SWIPE":0}}}}
I've tried a couple of different ways of casting the data type of groups when creating the external table. I tried using super type and when I select for groups I get the entire json, but when I select for an attribute of groups such as select groups.sellersAuths from ... or select groups."sellersAuths" from ... I get relation groups does not exist.
I've tried casting it as a struct<key:VARCHAR, value:struct<key:VARCHAR, value:struct<key:VARCHAR, value:FLOAT8>>>, whoever when access something like groups.key or groups.value.key, I always get NULL. I'm not sure how to cast the data type of groups when creating the external table. I'm not sure if my use case is what the super type is for.
I've also tried using JSON_PARSE after I cast the data to VARCHAR, or super or struct but that presents issues as well.
Thanks a ton for reading!
I apologize in advance if this is very simple and I am just missing it.
Would any of you know how to put custom attributes as column headers? I currently have a simple opt in survey on connect and I would like to have each of the 4 items as column headers and the score in the table results. I pull the data using an ODBC connection to excel so ideally I would like to just add this on the end of my current table if I can figure out how to do it.
This is how it currently looks in the output
{"effortscore":"5","promoterscore":"5","satisfactionscore":"5","survey_opt_in":"True"}
If you have any links or something that I can follow to try improve my knowledge.
Thanks in advance
There are multiple options to query data in JSON format in Athena, and based on your use case (data source, query frequency, query destination, etc.) you can choose what makes more sense.
String Column + JSON functions
This is usually the most straightforward option and a good starting point. You define the survey_output as a string column, and when you need to extract the specific attributes from the JSON string, you can apply the JSON functions in Trino/Athena: https://trino.io/docs/current/functions/json.html. For example:
SELECT
id,
json_query(
survey_output,
'lax $.satisfactionscore'
) AS satisfactionscore
FROM customers
String Column + JSON functions + View
The following way to simplify access to data without json_query functions is to define a VIEW on that table using the json_query syntax in the VIEW creation. You define the view once by a DBA, and when the users query the data, they see the columns they care about. For example:
CREATE VIEW survey_results AS
SELECT
id,
json_query(
survey_output,
'lax $.satisfactionscore'
) AS satisfactionscore
FROM customers;
With such dynamic view creation, you have more flexibility in what data will be easily exposed to the users.
Create a Table with STRUCT
Another option is to create the external table from the data source (files in S3, for example) with the STRUCT definition.
CREATE EXTERNAL TABLE survey (
id string,
survey_results struct<
effortscore:string,
promoterscore:string,
satisfactionscore:string,
survey_opt_in:string
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://<YOUR BUCKET HERE>/<FILES>'
I have a lot of json files stored in a json data type column in postgres. Now there are plenty of places where the key "warning" can apply. Unfortunately I can not get a json schema so I can not know in advance where exactly all the warning keys can show up. So I would like to do something like this:
select report #> '{*,warning}' from foo;
Is there some way to us wildcards in paths? Or is the only way to dynamically traverse a json value lets say key by key recursively in pl/sql function? (if even possible to return a set of cursors as one big cursor).
EDIT:
Interestingly the good old xml data type can do exactly what I need. So I am a bit puzzled why we can not do the same operations on json documents like:
select xmlexists('//town[text() = ''Toronto'']' PASSING BY REF '<root><oldtowns><town>Toronto</town><town>Ottawa</town></oldtowns><newtowns><town>Toronto</town><town>Ottawa</town></newtowns></root>');
select * from xmltable('//town' PASSING by ref '<root><oldtowns><town>Toronto</town><town>Ottawa</town></oldtowns><newtowns><town>Toronto</town><town>Ottawa</town></newtowns></root>' columns town varchar path 'text()')
I'm working with mongoDB, and I used a wrapper mongo/Postegres.
Now, I can find my tables and data.
I want to do some statistics but I can't reach objects that got json type in Postgres.
My problem is that I got all the object in json but I need to separate the fields.
I used this :
CREATE FOREIGN TABLE rents( _id NAME, status text, "from" json )
SERVER mongo_server
OPTIONS (database 'tr', collection 'rents');
The field "from" is an object.
I found something like this :
enter code here
but nothing happened
The error (why a screenshot??) means that the data are not in valid json format.
As a first step, you could define the column as type text instead of json. Then querying the foreign table will probably work, and you can see what is actually returned and why PostgreSQL thinks that this is not valid JSON.
Maybe you can create a view on top of the foreign table that converts the value to valid JSON for further processing.
Say I have a text field with JSON data like this:
{
"id": {
"name": "value",
"votes": 0
}
}
Is there a way to write a query which would find id and then would increment votes value?
I know i could just retrieve the JSON data update what I need and reinsert updated version, but i wonder is there a way to do this without running two queries?
UPDATE `sometable`
SET `somefield` = JSON_REPLACE(`somefield`, '$.id.votes', JSON_EXTRACT(`somefield` , '$.id.votes')+1)
WHERE ...
Edit
As of MySQL 5.7.8, MySQL supports a native JSON data type that enables efficient access to data in JSON documents.
JSON_EXTRACT will allow you to access a particular JSON element in a JSON field, while JSON_REPLACE will allow you to update it.
To specify the JSON element you wish to access, use a string with the format
'$.[top element].[sub element].[...]'
So in your case, to access id.votes, use the string '$.id.votes'.
The SQL code above demonstrates putting all this together to increment the value of a JSON field by 1.
I think for a task like this you're stuck using a plain old SELECT followed by an UPDATE (after you parse the JSON, increment the value you want, and then serialize the JSON back).
You should wrap these operations in a single transaction, and if you're using InnoDB then you might also consider using SELECT ... FOR UPDATE : http://dev.mysql.com/doc/refman/5.0/en/innodb-locking-reads.html
This is sort of a tangent, but I thought I'd also mention that this is the type of operation that a NoSQL database like MongoDB is quite good at.