Querying custom attributes in Athena - json

I apologize in advance if this is very simple and I am just missing it.
Would any of you know how to put custom attributes as column headers? I currently have a simple opt in survey on connect and I would like to have each of the 4 items as column headers and the score in the table results. I pull the data using an ODBC connection to excel so ideally I would like to just add this on the end of my current table if I can figure out how to do it.
This is how it currently looks in the output
{"effortscore":"5","promoterscore":"5","satisfactionscore":"5","survey_opt_in":"True"}
If you have any links or something that I can follow to try improve my knowledge.
Thanks in advance

There are multiple options to query data in JSON format in Athena, and based on your use case (data source, query frequency, query destination, etc.) you can choose what makes more sense.
String Column + JSON functions
This is usually the most straightforward option and a good starting point. You define the survey_output as a string column, and when you need to extract the specific attributes from the JSON string, you can apply the JSON functions in Trino/Athena: https://trino.io/docs/current/functions/json.html. For example:
SELECT
id,
json_query(
survey_output,
'lax $.satisfactionscore'
) AS satisfactionscore
FROM customers
String Column + JSON functions + View
The following way to simplify access to data without json_query functions is to define a VIEW on that table using the json_query syntax in the VIEW creation. You define the view once by a DBA, and when the users query the data, they see the columns they care about. For example:
CREATE VIEW survey_results AS
SELECT
id,
json_query(
survey_output,
'lax $.satisfactionscore'
) AS satisfactionscore
FROM customers;
With such dynamic view creation, you have more flexibility in what data will be easily exposed to the users.
Create a Table with STRUCT
Another option is to create the external table from the data source (files in S3, for example) with the STRUCT definition.
CREATE EXTERNAL TABLE survey (
id string,
survey_results struct<
effortscore:string,
promoterscore:string,
satisfactionscore:string,
survey_opt_in:string
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://<YOUR BUCKET HERE>/<FILES>'

Related

Creating External Table with Redshift Spectrum from nested JSON

I’m creating an external table from json data with input format org.apache.hadoop.mapred.TextInputFormat and output format org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat with SerDe org.openx.data.jsonserde.JsonSerDe.
One of the attributes of the json is a highly nested json called groups. The nested data doesn't follow a strict schema, so not all json within groups have the same attributes. I'm having trouble accessing group's attributes and I suspect that I am not casting groups to the proper datatype.
Here is a sample of the data
{"entity":"1111111","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"USAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SellerCent":0},"last_updated_by":{"JPAmazon":0}}}}
{"entity":"22222222","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"EUAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SWIPE":0},"last_updated_by":{"SWIPE":0}}}}
{"entity":"3333333","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"EUAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SWIPE":0},"last_updated_by":{"SWIPE":0}}}}
I've tried a couple of different ways of casting the data type of groups when creating the external table. I tried using super type and when I select for groups I get the entire json, but when I select for an attribute of groups such as select groups.sellersAuths from ... or select groups."sellersAuths" from ... I get relation groups does not exist.
I've tried casting it as a struct<key:VARCHAR, value:struct<key:VARCHAR, value:struct<key:VARCHAR, value:FLOAT8>>>, whoever when access something like groups.key or groups.value.key, I always get NULL. I'm not sure how to cast the data type of groups when creating the external table. I'm not sure if my use case is what the super type is for.
I've also tried using JSON_PARSE after I cast the data to VARCHAR, or super or struct but that presents issues as well.
Thanks a ton for reading!

How to store dynamically generated JSON object in Big Query Table?

I have a use case to store dynamic JSON objects in a column in Big Query. The schema of the object is dynamically generated by the source and not known beforehand. The number of key value pairs in the object can differ as well, as shown below.
Example JSON objects:
{"Fruit":"Apple","Price":"10","Sale":"No"}
{"Movie":"Avatar","Genre":"Fiction"}
I could achieve the same in Hive by defining the column as map<string, string> object and I could query the data in the column like col_name["Fruit"] or col_name["Movie"] for that corresponding row.
Is there an equivalent way of above usage in Big Query? I came across 'RECORD' data type but the schema needs to be same for all the objects in the column.
Note: Storing the column as string datatype is not an option as the users need to query the data on the keys directly without parsing after retrieving the data.
Storing the data as a JSON string seems to be the only way to implement your requirement, at the moment. As a workaround, you can create a JavaScript UDF that parses the JSON string and extracts the necessary information. Below is a sample UDF.
CREATE TEMP FUNCTION extract_from_json(json STRING, key STRING)
RETURNS STRING
LANGUAGE js AS """
const obj = JSON.parse(json);
return obj[key];
""";
WITH json_table AS (
SELECT '{"Fruit":"Apple","Price":"10","Sale":"No"}' json_data UNION ALL
SELECT '{"Movie":"Avatar","Genre":"Fiction"}' json_data
)
SELECT extract_from_json(json_data, 'Movie') AS photos
FROM json_table
You can also check out the newly introduced JSON data type in BigQuery. The data type offers more flexibility when handling JSON data but please note that the data type is still in preview and is not recommended for production. You will have to enroll in this preview. For more information on working with JSON data, refer to this documentation.

partitioned json queries in athena return no data

I have been trying to setup an athena database for sometime, I seem to have the database setup correctly but when I query it returns no data. The data I am querying is a series of partitioned S3 files in the structure of
"S3://bucket_name/data1=partition_1/data2=partition_2/data3=partition_3/data4=partition_4/file.json"
there can be multiple file.json in one partition e.g.
"S3://bucket_name/data1=partition_1/data2=partition_2/data3=partition_3/data4=partition_4/file1.json"
"S3://bucket_name/data1=partition_1/data2=partition_2/data3=partition_3/data4=partition_4/file2.json"
Below are the queries I am running along with the create command and the data stored
CREATE EXTERNAL TABLE bench_logs (
id string,
filename string,
data struct<transmit_start: timestamp,
transmit_end:timestamp,
transfer_start:timestamp,
transfer_end:timestamp,
processing_start:timestamp,
processing_end:timestamp>
)
PARTITIONED BY (
partition_1 string,
partition_2 string,
partition_3 date,
partition_4 string
)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('ignore.malformed.json' = 'true')
LOCATION
's3://benchmark-files/complete/'
TBLPROPERTIES (
'classification'='json',
'storage.location.template'='s3://iceqube-benchmark-files/complete/partition_1=${partition_1}/partition_2=${partition_2}/partition_3=${partition_3}/partition_4=${partition_4}/')
that table is being queried like:
SELECT id FROM "benchmark"."bench_logs"
WHERE partition_1='foo'
AND partition_2='bar'
AND partition_3=cast('1970-01-01' as date)
AND partition_4='09:30:00';
the query says it ran correctly but i see no data other than column headers.
if anymore data is needed ill provide it, i have been stuck a few days now and cant get my head around it at all. thanks in advance.
Before you can query a partitioned table you must add the partitions to it. This can be done with ALTER TABLE bench_logs ADD PARTITION …, or by using partition projection, as well as other ways.
Also, you seem to have mixed up the keys and values of your Hive partition scheme: if a partition key is called partition_1 the S3 URI should be …/partition_1=data_1/…, not …/data_1=partition_1/….
Coming Back incase anyone else has these issues.
I followed Theo's advice and still had problems querying.
turns out the problem was having a partition with a value containing ":" as this is a restricted character in S3 normally but as i was writing programatically it was allowed through.
a full explanation of this is answered better here

How to query an array field (AWS Glue)?

I have a table in AWS Glue, and the crawler has defined one field as array.
The content is in S3 files that have a json format.
The table is TableA, and the field is members.
There are a lot of other fields such as strings, booleans, doubles, and even structs.
I am able to query them all using a simpel query such as:
SELECT
content.my_boolean,
content.my_string,
content.my_struct.value
FROM schema.tableA;
The issue is when I add content.members into the query.
The error I get is: [Amazon](500310) Invalid operation: schema "content" does not exist.
Content exists because i am able to select other fiels from the main key in the json (content).
Probably is something related with how to perform the query agains array field in Spectrum.
Any idea?
You have to rename the table to extract the fields from the external schema:
SELECT
a.content.my_boolean,
a.content.my_string,
a.content.my_struct.value
FROM schema.tableA a;
I had the same issue on my data, I really don't know why it needs this cast but it works. If you need to access elements of an array you have to explod it like:
SELECT member.<your-field>,
FROM schema.tableA a, a.content.members as member;
Reference
You need to create a Glue Classifier.
Select JSON as Classifier type
and for the JSON Path input the following:
$[*]
then run your crawler. It will infer your schema and populate your table with the correct fields instead of just one big array. Not sure if this was what you were looking for but figured I'd drop this here just in case others had the same problem I had.

Query spark on JSON object stored on Cassandra DB

I built the structure on cassandra DB to store the time series data of the OS data like services, process and other information. To understand how to works Cassandra about storing JSON data and retrieval the data by CQL queries with condition I prefered to simplify the model. Because in the total model DB I'll have the TYPE more complex than report_object like hashMap of array of hashMap for example:
Type NETSTAT--> Object[n] --> {host:192.168.0.23, protocol: TCP ,LocalAddress : 0.0.0.0}
so the Type NETSTAT will have a list of hashMaps that will contain the fields key -> value.
For simplify I have choosen to show the following schema:
CREATE TYPE report_object (RTIME varchar, RMINORVER int, RUSER varchar, RLANG varchar, RSCRIPT varchar, RMAJORVER int, RHOST varchar, RPATH varchar);
CREATE TABLE test (
REPORTUUID uuid PRIMARY KEY,
report frozen<report_object>);
Inside the table I injectioned the JSON data with the followed query inside java class:
INSERT INTO test JSON '{"REPORTUUID": "9fb21fb9-333e-4017-ab77-0fa6ee1e20e3" ,"REPORT":{"RTIME":"6/MAR/2016 6:0:0 PM","RMINORVER":0,"RUSER":"Administrator","RLANG":"vbs","RSCRIPT":"Main","RMAJORVER":5,"RHOST":"WIN-SAPV9MUEMNS","RPATH":"C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\IXP000.TMP"}}';
I inectioned other data with the query above.
The questions to clarify my concepts are:
- I would like to do the queries with conditions that check inside TYPE defined, is it possible with CQL or is necessary to use spark SQL?
Is design DB model right for the purpose (Because I have passed from RDBMS to DB NoSQL) ?
To be able to query User Defined Type using Cassandra you'll have to create an index first:
CREATE INDEX on test.test(report);
but it allows only a predicate based on a full document:
SELECT * FROM test
WHERE report=fromJson('{"RTIME":"6/MAR/2016 6:0:0 PM","RMINORVER":0,"RUSER":"Administrator","RLANG":"vbs","RSCRIPT":"Main","RMAJORVER":5,"RHOST":"WIN-SAPV9MUEMNS","RPATH":"C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\IXP000.TMP"}');
You'll find more details and explanation in how to filter cassandra query by a field in user defined type
When exposed using Spark these values can be filtered using filter on CassandraTableScanRDD:
val rdd = sc.cassandraTable("test", "test")
rdd.filter(row =>
row.getUDTValue("report").getString("rscript") == "Main")
or where / filter on a DataFrame:
df.where($"report.rscript" === "Main")
Although query like this using Spark a whole table has to be fetched before data can be filtered. While it is not clear what exactly you are trying to achieve but it is rather unlikely this will be an useful structure in general.