Defining schema for nested JSON structures in AWS Athena - json

I have nested json with below format:
{
"id": "212",
"profile": "unknown",
"role":
"{
"admin_role": "yes",
"developer_role":"yes"
}"
}
My goal is to define schema while creating table but facing error for role since the data is provided as string.
Expectation:
CREATE EXTERNAL TABLE profile
id bigint,
profile string,
role struct<
admin_role:string,
developer_role:string
>
row format serde 'org.openx.data.jsonserde.JsonSerDe'
location 's3://<bucket_name>/<path>';
Any suggestion to define schema for role at first level table creation itself?
Thanks

There was an issue with your JSON and after correcting it as shown below was able to query it. Makes sure you have each JSON record on separate line.
{"id":"212","profile":"unknown","role":{"admin_role":"yes","developer_role":"yes"}}
DDL used to query JSON:
CREATE EXTERNAL TABLE `sf_73855155`(
`id` string COMMENT 'from deserializer',
`profile` string COMMENT 'from deserializer',
`role` struct<admin_role:string,developer_role:string> COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'paths'='id,profile,role')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '<path>'
Output:

Related

How to query JSON data in Athena with an # symbol in the key name and duplicate keys

The data I have been tasked to query is structured like this:
{
"#timestamp": "2022-11-17T21:00:19.191+00:00",
"#version": 1,
"message": "log message",
"logger_name": "com.logger.name",
"thread_name": "tomcat-thread-13",
"level": "INFO",
"level_value": 20000,
"application_name": "app_name",
"vpc": "vpc_name",
"region": "eu-west-1",
"aid": "ffffffff-ffff-ffff-ffff-ffffffffffff",
"account": "prod",
"rq": "ffffffff-ffff-ffff-ffff-ffffffffffff",
"log_shipper": "firehose",
"application_name": "app_name",
"account": "prod",
"region": "eu-west-1"
}
As you can see there are some duplicate keys in here, so both the Hive and OpenX JSON SerDe throw an error and won't query it at all.
I've created a table using the Ion SerDe, which can read the data, but the #timestamp and #version fields are always blank, all the other fields are read correctly.
The initial table definition I had was this...
CREATE EXTERNAL TABLE firehose_logs_pe (
`#timestamp` STRING,
`#version` STRING,
<other columns>
)
ROW FORMAT SERDE
'com.amazon.ionhiveserde.IonHiveSerDe'
STORED AS ION
LOCATION 's3://s3-bucket-name/folder/'
I also tried to rename the fields and use a path extractor to get the values, like this...
CREATE EXTERNAL TABLE firehose_logs_pe (
ts STRING,
version STRING,
<other columns>
)
ROW FORMAT SERDE
'com.amazon.ionhiveserde.IonHiveSerDe'
WITH SERDEPROPERTIES (
'ion.ts.path_extractor' = '(`#timestamp`)',
'ion.version.path_extractor' = '(`#version`)'
)
STORED AS ION
LOCATION 's3://s3-bucket-name/folder/'
However, the values of the ts and version fields are still empty. The query also seems to run slower using the path extractors.
Is there any way to query this data in this format with Athena? As a test I did a find and replace on one of the JSON files and removed the #, at which point everything worked as it should, however this is not a practical solution when I have about 20Tb of data to query in hundreds of millions of files.

AWS Glue Crawler for JSONB column in PostgreSQL RDS

I've created a crawler that looks at a PostgreSQL 9.6 RDS table with a JSONB column but the crawler identifies the column type as "string". When I then try to create a job that loads data from a JSON file on S3 into the RDS table I get an error.
How can I map a JSON file source to a JSONB target column?
It's not quite a direct copy, but an approach that has worked for me is to define the column on the target table as TEXT. After the Glue job populates the field, I then convert it to JSONB. For example:
alter table postgres_table
alter column column_with_json set data type jsonb using column_with_json::jsonb;
Note the use of the cast for the existing text data. Without that, the alter column would fail.
Crawler will identify JSONB column type as "string" but you can try to use Unbox Class in Glue to convert this column to json
let's check the following table in PostgreSQL
create table persons (id integer, person_data jsonb, creation_date timestamp )
There is an example of one record from person table
ID = 1
PERSON_DATA = {
"firstName": "Sergii",
"age": 99,
"email":"Test#test.com"
}
CREATION_DATE = 2021-04-15 00:18:06
The following code need to be added in Glue
# 1. create dynamic frame from catalog
df_persons = glueContext.create_dynamic_frame.from_catalog(database = "testdb", table_name = "persons", transformation_ctx = "df_persons ")
# 2.in path you need to add your jsonb column name that need to be converted to json
df_persons_json = Unbox.apply(frame = df_persons , path = "person_data", format="json")
# 3. converting from dynamic frame to data frame
datf_persons_json = df_persons_json.toDF()
# 4. after that you can process this column as a json datatype or create dataframe with all necessary columns , each json data element can be added as a separate column in dataframe :
final_df_person = datf_persons_json.select("id","person_data.age","person_data.firstName","creation_date")
You can also check the following link:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-Unbox.html

Hive json serde - Keys with white spaces not populating

I have an external table that is built off of a json file.
All of the json keys are columns and are populated as expected except for one key that has a space.
Here is the DDL:
CREATE EXTERNAL TABLE foo.bar
( event ARRAY <STRUCT
value:STRING
,info:STRUCT
<id:STRING
,event_source:STRING>>
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES("mapping.event_source"="event source")
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'foo/bar'
All of the values show up as expected except for event_source, which shows up as NULL. The original form of event_source in the json file is 'event source' without the single quotes.
Is there something I need to do different with the WITH SERDEPROPERTIES setting in order to get the key to work properly?
Thanks
you mean that the json has data like
{ id: "myid", event source: "eventsource" }
If so, there's not much that can be done since it's simply broken JSON.
If not, can you post a sample of the JSON you're trying to read ?
I have encountered a similar problem as above but with a slight variation that the input data is correct json.
I have an external table that is built off of a json file. All of the json keys are populated except one msrp_currency
Here is the DDL:
CREATE EXTERNAL TABLE foo.bar
( id string,
variants array<struct<pid:string, msrp_currency:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( "ignore.malformed.json" = "true" ,
'mapping.variants.msrp_currency' = 'variants.msrpcurrency')
LOCATION 'foo/bar'
All of the values show up as expected except for msrp_currency, which shows up as NULL. The reason I need to introduce underscore is because later I need to extract the same field value as msrpCurrecny using brickhouse to_json UDF.
sample values:
{ "pid": "mypid", "msrpCurrency": "USD" }

issue with Hive Serde dealing nested structs

I am trying to load a huge volume json data with nested structure to hive using a Json serde. some of the field names start with $ in nested structure. I am mapping hive filed names Using SerDeproperties, but how ever when i query the table, getting null in the field starting with $, tried with different syntax,but no luck.
Sample JSON:
{
"_id" : "319FFE15FF90",
"SomeThing" :
{
"$SomeField" : 22,
"AnotherField" : 2112,
"YetAnotherField": 1
}
. . . etc . . . .
Using a schema as follows:
create table testSample
(
`_id` string,
something struct
<
$somefield:int,
anotherfield:bigint,
yetanotherfield:int
>
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.somefield" = "$somefield"
);
This schema builds OK, however, somefield(starting with $) in the above table is always returning null (all the other values exist and are correct).
We've been trying a lot of syntax combinations, but to no avail.
Does anyone know the trick to hap a nested field with a leading $ in its name?
You almost got it right. Try creating the table like this.
The mistake you're making is that when mapping in the serde properties (mapping.somefield ="$somefield") you're saying "when looking for the hive column named 'somefield', look for the json field '$somefield', but in hive you defined the column with the dollar sign, which if not outright illegal it's for sure not the best practice in hive.
create table testSample
(
`_id` string,
something struct
<
somefield:int,
anotherfield:bigint,
yetanotherfield:int
>
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.somefield" = "$somefield"
);
I tested it with some test data:
{ "_id" : "123", "something": { "$somefield": 12, "anotherfield":13,"yetanotherfield":100}}
hive> select something.somefield from testSample;
OK
12
I am suddenly starting to see this problem as well but for normal column names as well (no special characters such as $)
I am populating an external table (Temp) from another internal table (Table2) and want the output of Temp table in JSON format. I want column names in camel case in the output JSON file and so am also using the Serdepoperties in the Temp table to specify correct names. However, I am seeing that when I do Select * from the Temp table, it gives NULL values for the columns whose names have been used in the mapping.
I am running Hive 0.13. Here are the commands:
Create table command:
CREATE EXTERNAL TABLE Temp (
data STRUCT<
customerId:BIGINT, region:STRING, marketplaceId:INT, asin:ARRAY<STRING>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'mapping.customerid' = 'customerId',
'mapping.marketplaceid' = 'marketplaceId'
)
LOCATION '/output';
INSERT INTO TABLE Temp
SELECT
named_struct ('customerId',customerId, 'region', region, 'marketplaceId', marketplaceId, 'asin', asin)
FROM Table2;
Select * from Temp:
{"customerid":null,"region":"EU","marketplaceid":null,"asin":["B000FC1PZC"]}
{"customerid":null,"region":"EU","marketplaceid":null,"asin":["B000FC1C9G"]}
See how "customerid" and "marketplaceid" are null. Generated JSON file is:
{"data":{"region":"EU","asin":["B000FC1PZC"]}}
{"data":{"region":"EU","asin":["B000FC1C9G"]}}
Now, if I remove the with serdeproperties, the table starts getting all values:
{"customerid":1,"region":"EU","marketplaceid":4,"asin":["B000FC1PZC"]}
{"customerid":2,"region":"EU","marketplaceid":4,"asin":["B000FC1C9G"]}
And then the JSON file so generated is:
{"data":{"region":"EU","marketplaceid":4,"asin":["B000FC1PZC"],"customerid":1}}
{"data":{"region":"EU","marketplaceid":4,"asin":["B000FC1C9G"],"customerid":2}}

Load complex json in hive using jsonserde

I am trying to build a table in hive for following json
{
"business_id": "vcNAWiLM4dR7D2nwwJ7nCA",
"hours": {
"Tuesday": {
"close": "17:00",
"open": "08:00"
},
"Friday": {
"close": "17:00",
"open": "08:00"
}
},
"open": true,
"categories": [
"Doctors",
"Health & Medical"
],
"review_count": 9,
"name": "Eric Goldberg, MD",
"neighborhoods": [],
"attributes": {
"By Appointment Only": true,
"Accepts Credit Cards": true,
"Good For Groups": 1
},
"type": "business"
}
I can create a table using following DDL,however I get an exception while querying that table.
CREATE TABLE IF NOT EXISTS business (
business_id string,
hours map<string,string>,
open boolean,
categories array<string>,
review_count int,
name string,
neighborhoods array<string>,
attributes map<string,string>,
type string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde';
The exception while retrieving data is "ClassCast:Cant cast jsoanarray to json object" . What is the correct schema for this json? Is there any took which can help me generate correct schema for given json to be used with jsonserde?
It looks to me that the problem is hours which you defined as hours map<string,string> but should be a map<string,map<string,string> instead.
There's a tool you can use to generate the hive table definition automatically from your JSON data: https://github.com/quux00/hive-json-schema
but you may want to adjust it because when encountering a JSON Object (Anything between {} ) the tool can't know wether to translate it to a hive map or to a struct.
On your data, the tool gives me this:
CREATE TABLE x (
attributes struct<accepts credit cards:boolean,
by appointment only:boolean, good for groups:int>,
business_id string,
categories array<string>,
hours map<string:struct<close:string, open:string>
name string,
neighborhoods array<string>,
open boolean,
review_count int,
type string
)
but it looks like you want something like this:
CREATE TABLE x (
attributes map<string,string>,
business_id string,
categories array<string>,
hours map<string,struct<close:string, open:string>>,
name string,
neighborhoods array<string>,
open boolean,
review_count int,
type string
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;
hive> load data local inpath 'json.data' overwrite into table x;
hive> Table default.x stats: [numFiles=1, numRows=0, totalSize=416,rawDataSize=0]
OK
hive> select * from x;
OK
{"accepts credit cards":"true","by appointment only":"true",
"good for groups":"1"}
vcNAWiLM4dR7D2nwwJ7nCA
["Doctors","Health & Medical"]
{"tuesday":{"close":"17:00","open":"08:00"},
"friday":{"close":"17:00","open":"08:00"}}
Eric Goldberg, MD ["HELLO"] true 9 business
Time taken: 0.335 seconds, Fetched: 1 row(s)
hive>
A few notes though:
Notice I used a different JSON SerDe because I don't have on my system the one you used. I used this one, I like it better because, well, I wrote it. But the create statement should work just as well with the other serde.
You may want to convert some of those maps to structs, as they may be more convenient to query. For instance, attributes could be a struct, but you'd need to map the names with a space in them like accepts credit cards. My SerDe allows to map a json attribute to a different hive column name. That is also needed then JSON uses an attribute that is a hive keyword like 'timestamp' or 'create'.