Creating External Table with Redshift Spectrum from nested JSON - json

I’m creating an external table from json data with input format org.apache.hadoop.mapred.TextInputFormat and output format org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat with SerDe org.openx.data.jsonserde.JsonSerDe.
One of the attributes of the json is a highly nested json called groups. The nested data doesn't follow a strict schema, so not all json within groups have the same attributes. I'm having trouble accessing group's attributes and I suspect that I am not casting groups to the proper datatype.
Here is a sample of the data
{"entity":"1111111","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"USAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SellerCent":0},"last_updated_by":{"JPAmazon":0}}}}
{"entity":"22222222","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"EUAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SWIPE":0},"last_updated_by":{"SWIPE":0}}}}
{"entity":"3333333","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"EUAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SWIPE":0},"last_updated_by":{"SWIPE":0}}}}
I've tried a couple of different ways of casting the data type of groups when creating the external table. I tried using super type and when I select for groups I get the entire json, but when I select for an attribute of groups such as select groups.sellersAuths from ... or select groups."sellersAuths" from ... I get relation groups does not exist.
I've tried casting it as a struct<key:VARCHAR, value:struct<key:VARCHAR, value:struct<key:VARCHAR, value:FLOAT8>>>, whoever when access something like groups.key or groups.value.key, I always get NULL. I'm not sure how to cast the data type of groups when creating the external table. I'm not sure if my use case is what the super type is for.
I've also tried using JSON_PARSE after I cast the data to VARCHAR, or super or struct but that presents issues as well.
Thanks a ton for reading!

Related

Querying custom attributes in Athena

I apologize in advance if this is very simple and I am just missing it.
Would any of you know how to put custom attributes as column headers? I currently have a simple opt in survey on connect and I would like to have each of the 4 items as column headers and the score in the table results. I pull the data using an ODBC connection to excel so ideally I would like to just add this on the end of my current table if I can figure out how to do it.
This is how it currently looks in the output
{"effortscore":"5","promoterscore":"5","satisfactionscore":"5","survey_opt_in":"True"}
If you have any links or something that I can follow to try improve my knowledge.
Thanks in advance
There are multiple options to query data in JSON format in Athena, and based on your use case (data source, query frequency, query destination, etc.) you can choose what makes more sense.
String Column + JSON functions
This is usually the most straightforward option and a good starting point. You define the survey_output as a string column, and when you need to extract the specific attributes from the JSON string, you can apply the JSON functions in Trino/Athena: https://trino.io/docs/current/functions/json.html. For example:
SELECT
id,
json_query(
survey_output,
'lax $.satisfactionscore'
) AS satisfactionscore
FROM customers
String Column + JSON functions + View
The following way to simplify access to data without json_query functions is to define a VIEW on that table using the json_query syntax in the VIEW creation. You define the view once by a DBA, and when the users query the data, they see the columns they care about. For example:
CREATE VIEW survey_results AS
SELECT
id,
json_query(
survey_output,
'lax $.satisfactionscore'
) AS satisfactionscore
FROM customers;
With such dynamic view creation, you have more flexibility in what data will be easily exposed to the users.
Create a Table with STRUCT
Another option is to create the external table from the data source (files in S3, for example) with the STRUCT definition.
CREATE EXTERNAL TABLE survey (
id string,
survey_results struct<
effortscore:string,
promoterscore:string,
satisfactionscore:string,
survey_opt_in:string
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://<YOUR BUCKET HERE>/<FILES>'

How to query an array field (AWS Glue)?

I have a table in AWS Glue, and the crawler has defined one field as array.
The content is in S3 files that have a json format.
The table is TableA, and the field is members.
There are a lot of other fields such as strings, booleans, doubles, and even structs.
I am able to query them all using a simpel query such as:
SELECT
content.my_boolean,
content.my_string,
content.my_struct.value
FROM schema.tableA;
The issue is when I add content.members into the query.
The error I get is: [Amazon](500310) Invalid operation: schema "content" does not exist.
Content exists because i am able to select other fiels from the main key in the json (content).
Probably is something related with how to perform the query agains array field in Spectrum.
Any idea?
You have to rename the table to extract the fields from the external schema:
SELECT
a.content.my_boolean,
a.content.my_string,
a.content.my_struct.value
FROM schema.tableA a;
I had the same issue on my data, I really don't know why it needs this cast but it works. If you need to access elements of an array you have to explod it like:
SELECT member.<your-field>,
FROM schema.tableA a, a.content.members as member;
Reference
You need to create a Glue Classifier.
Select JSON as Classifier type
and for the JSON Path input the following:
$[*]
then run your crawler. It will infer your schema and populate your table with the correct fields instead of just one big array. Not sure if this was what you were looking for but figured I'd drop this here just in case others had the same problem I had.

How do I convert a column of JSON strings into a parquet table

I am trying to convert some data that I am receiving into a parquet table that I can eventually use for reporting, but feel like I am missing a step.
I receive files that are CSVs where the format is "id", "event", "source" where the "event" column is a GZIP compressed JSON string. I've been able to get a dataframe set up that extracts the three columns, including getting the JSON string unzipped. So I have a table now that has
id | event | source | unencoded_event
Where the unencoded_event is the JSON string.
What I'd like to do at this point is to take that one string column of JSON and parse it out into individual columns. Based on a comment from another developer (that the process of converting to parquet is smart enough to just use the first row of my results to figure out schema), I've tried this:
df1 = spark.read.json(df.select("unencoded_event").rdd).write.format("parquet").saveAsTable("test")
But this just gives me a single column table with a column of _corrupt_record that just has the JSON string again.
What I'm trying to get to is to take schema:
{
"agent"
--"name"
--"organization"
"entity"
--"name"
----"type"
----"value"
}
And get the table to, ultimately, look like:
AgentName | Organization | EventType | EventValue
Is the step I'm missing just explicitly defining the schema or have I oversimplified my approach?
Potential complications here: the JSON schema is actually more involved than above; I've been assuming I can expand out the full schema into a wider table and then just return the smaller set I care about.
I have also tried taking a single result from the file (so, a single JSON string), saving it as a JSON file and trying to read from it. Doing so works, i.e., doing the spark.read.json(myJSON.json) parses the string into the arrays I was expecting. This is also true if I copy multiple strings.
This doesn't work if I take my original results and try to save them. If I try to save just the column of strings as a json file
dfWrite = df.select(col("unencoded_event"))
dfWrite.write.mode("overwrite").json(write_location)
then read them back out, this doesn't behave the same way...each row is still treated as strings.
I did find one solution that works. This is not a perfect solution (I'm worried that it's not scalable), but it gets me to where I need to be.
I can select the data using get_json_object() for each column I want (sorry, I've been fiddling with column names and the like over the course of the day):
dfResults = df.select(get_json_object("unencoded_event", "$.agent[0].name").alias("userID"),
get_json_object("unencoded_event", "$.entity[0].identifier.value").alias("itemID"),
get_json_object("unencoded_event", "$.entity[0].detail[1].value").alias("itemInfo"),
get_json_object("unencoded_event", "$.recorded").alias("timeStamp"))
The big thing I don't love about this is that it appears I can't use filter/search options with get_json_object(). That's fine for the forseeable future, because right now I know where all the data should be and don't need to filter.
I believe I can also use from_json() but that requires defining the schema within the notebook. This isn't a great option because I only need a small part of the JSON, so it feels like unnecessary effort to define the entire schema. (I also don't have control over what the overall schema would be, so this becomes a maintenance issue.)

how separate json field in postgres and got the field

I'm working with mongoDB, and I used a wrapper mongo/Postegres.
Now, I can find my tables and data.
I want to do some statistics but I can't reach objects that got json type in Postgres.
My problem is that I got all the object in json but I need to separate the fields.
I used this :
CREATE FOREIGN TABLE rents( _id NAME, status text, "from" json )
SERVER mongo_server
OPTIONS (database 'tr', collection 'rents');
The field "from" is an object.
I found something like this :
enter code here
but nothing happened
The error (why a screenshot??) means that the data are not in valid json format.
As a first step, you could define the column as type text instead of json. Then querying the foreign table will probably work, and you can see what is actually returned and why PostgreSQL thinks that this is not valid JSON.
Maybe you can create a view on top of the foreign table that converts the value to valid JSON for further processing.

Accessing child fields of a nested json data using sparksql

I'm doing an exploratory data analysis with hadoop job history files log data.
Below given is the sample data used for the analysis
{"type":"AM_STARTED","event":{"org.apache.hadoop.mapreduce.jobhistory.AMStarted":{"applicationAttemptId":"appattempt_1450790831122_0001_000001","startTime":1450791753482,"containerId":"container_1450790831122_0001_01_000001","nodeManagerHost":"centos65","nodeManagerPort":52981,"nodeManagerHttpPort":8042}}}
i just need to select the child values like applicationAttemptId, startTime, containerId of the event
org.apache.hadoop.mapreduce.jobhistory.AMStarted
i tried the below simple select query
val out=sqlcontext.sql("select event.org.apache.hadoop.mapreduce.jobhistory.AMStarted.applicationAttemptId from sample")
but it throws the below error
org.apache.spark.sql.analysisException: no such struct field org in org.apache.hadoop.mapreduce.jobhistory.AMStarted.applicationAttemptId
unfortunately the data field look like this "org.apache.hadoop.mapreduce.jobhistory.AMStarted"
i manipulated the data myself like this org_apache_hadoop_mapreduce_jobhistory.AMStarted and tried the same query like this one below
val out=sqlcontext.sql("select event.org_apache_hadoop_mapreduce_jobhistory_AMStarted.applicationAttemptId from sample")
Now i'm able to access the child fields of AMStarted. but it's not the right way to do so,
Is there any way to do so without manipulating the data.
After spending some quality time searching for a solution , got the simple idea of using back ticks as quotes in the field name did the trick for me.
`org.apache.hadoop.mapreduce.jobhistory`.AMStarted
And then the query works like a charm,
val out=sqlcontext.sql("select event.`org.apache.hadoop.mapreduce.jobhistory.AMStarted'.applicationAttemptId from sample")