Athena DDL statement for different data structures - json

I have data in XML form which I have converted in JSON format through glue crawler. The problem is in writing the DDL statement for a table in Athena as you can see below there is a contact attribute in JSON data. Somewhere it is a structure (single instance) and somewhere it is in array form (multiple instances). I am sharing the DDL statements below as well for each type.
JSON Data Type 1
"ContactList": {
"Contact": {
}
}
Athena DDL Statement
CREATE EXTERNAL TABLE IF NOT EXISTS table_name (
ContactList: struct<
Contact: struct<
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3_bucket_path'
TBLPROPERTIES ('has_encrypted_data'='false')
JSON Data Type 2
"ContactList": {
"Contact": [
{},
{}
]
}
Athena DDL Statement
CREATE EXTERNAL TABLE IF NOT EXISTS table_name (
ContactList: struct<
Contact: array <
struct<
>
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3_bucket_path'
TBLPROPERTIES ('has_encrypted_data'='false')
I am able to write DDL statement for one case at a time only and it work perfectly for individual type. My question is how we can write DDL statements so it can cater to both types either it is struct or array. Thanks in advance.

The way you solve this in Athena is that you use the string type for the Contact field of the ContactList column, and then JSON functions in your queries.
When you query you can for example do (assuming contacts have a "name" field):
SELECT
COALESCE(
json_extract_scalar(ContactList.Contact, '$.name[0]'),
json_extract_scalar(ContactList.Contact, '$.name')
) AS name
FROM table_name
This uses json_extract_scalar which parses a string as JSON and then extracts a value using a JSONPath expression. COALESCE picks the first non-null value, so if the first JSONPath expression does not yield any value (because the property is not an array), the second is attempted.

Related

Error in data while creating external tables in Athena

I have my data in CSV format in the below form:
Id -> tinyint
Name -> String
Id Name
1 Alex
2 Sam
When I export the CSV file to S3 and create an Athena table, the data transform into the following format.
Id Name
1 "Alex"
2 "Sam"
How do I get rid of the double quotes while creating the table?
Any help is appreciated.
By default if SerDe is not specified, Athena is using LasySimpleSerDe, it does not support quoted values and reads quotes as a part of value. If your CSV file contains quoted values, use OpenCSVSerde (specify correct separatorChar if it is not comma):
CREATE EXTERNAL TABLE mytable(
id tinyint,
Name string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://my-bucket/mytable/'
;
Read the manuals: https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html
See also this answer about data types in OpenCSVSerDe

Handling corrupt JSON structure with Athena AWS ( HIVE_BAD_DATA)

I need to access a JSON structure from the table "data_test":
id (string)
att (struct<field1:string;field2:string;field3:int>)
SELECT
id,
att.field1,
att.field2,
att.field3
FROM database.data_test as rawdata
I receive the following error:
HIVE_BAD_DATA: Error parsing field value for field 1: For input
string: "2147483648"
So, as I understand do I have a numeric-value 2147483648 in string-field, that causes the corrupt data.
Then I tried to CAST the string-fields as varchar, but the result was the same.
SELECT
id,
CAST(att.field1 as VARCHAR) as field1,
CAST(att.field2 as VARCHAR) as field2,
att.field3
FROM database.data_test as rawdata
HIVE_BAD_DATA: Error parsing field value for field 1: For input
string: "2147483648"
When I just select the id, than everything works fine.
SELECT
id
FROM database.data_test as rawdata
Unfortunately, I do not even know the IDs of the corrupt data, otherwise I would just skip them with the WHERE-clause. I just have access over Athena to the data, so it is hard for me to get more information.
I asked the AWS-admin to add the ignore.malformed -option, so that the JSON-file do not allow corrupt data. He told me, that he can not do it, because than to much data will be skipped.
WITH SERDEPROPERTIES ('ignore.malformed.json' = 'true')
The admin gave me the DDL:
CREATE EXTERNAL TABLE ${dbName}.${tableName}(
`id` string,
`att` struct<field1:string,field2:string,field3:int>
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('paths'='att')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://${outPutS3BucketLocation}'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='create_benchmark_athena_table',
'averageRecordSize'='87361',
'classification'='json',
'compressionType'='gzip',
'objectCount'='1',
'recordCount'='100',
'sizeKey'='315084',
'typeOfData'='file')
I have three questions:
Is there a way to SELECT even corrupt data e.g. all fields as a string?
Can I just skip the corrupt data in SELECT statement?
How can I get more information e.g. the id-field of the corrupt data to skip it in where-clause?
Thanks!

JSON to HIVE ingestion

add jar /path to/hive-serdes-1.0-SNAPSHOT.jar;
CREATE EXTERNAL TABLE student
( id int, student_id INT, type STRING, score DOUBLE
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES ( 'mongo.columns.mapping'='{ "id":"_id",
"student_id":"student_id", "type":"type","score":"score" }' )
TBLPROPERTIES('mongo.uri'='mongodb://****---****.nam.nsroot.net:*****/admin.student');
I am able to successfully run the code and ingest data. But the "id" field gets populated as NULL.
Should i change the data type ? I tried STRING as well. Got the same result.
According to the mongo-hadoop Hive SerDe, ObjectId corresponds to a special instance of STRUCT.
A Hive field corresponding to an ObjectId must be a STRUCT with the fields oid, a STRING, and bsontype, an INT, and nothing else. The oid is the string of the ObjectId while the bsontype should always be 8. Per your example, it should be :
CREATE EXTERNAL TABLE student
(id STRUCT<oid:STRING, bsontype:INT>, student_id INT, type STRING, score DOUBLE)
Where the output would be something similar to:
{"oid":"56d6e0f6ff1f17f74ebbc16c","bsontype":8}
{"oid":"56d6e0f8ff1f17f74ebbc16d","bsontype":8}
...
The above was tested with: MongoDB v3.2.x, mongo-java-driver-3.2.2.jar, mongo-hadoop-core-1.5.0-rc0.jar, mongo-hadoop-hive-1.5.0-rc0.jar.

How can I parse a Json column of a Hive table using a Json serde?

I am trying to load de-serialized json events into different tables, based on the name of the event.
Right now I have all the events in the same table, the table has only two columns EventName and Payload (the payload stores the json representation of the event):
CREATE TABLE event( EventName STRING, Payload STRING)
So basically what i want is to load the data in the following table:
CREATE TABLE TempEvent ( Column1 STRING, Column2 STRING, Column3 STRING )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;
And load the events with something like:
INSERT INTO TempEvent select Payload from event where EventName='TempEvent';
But hive is throwing an exception saying that the destination table has 3 columns, and the select statement just 1.
Is there other way to acomplish this or i am doing something wrong?.
The JSON serde requires a table with one JSON per line in order to use it. So it won't work with your input table because the line
TempEvent, {"Column1":"value1","Column2":"value2","Column3":"value3"}
is not a valid JSON. So first you need to move it into a new intermediate table which just contains valid JSON, and then populate the JSON serde table from there using load data:
create table event_json (Payload string)
stored as textfile;
insert into table event_json
select Payload from event
where EventName='TempEvent';
create table TempEvent (Column1 string, Column2 string, Column3 string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
load data inpath '/user/hive/warehouse/event_json' overwrite into table TempEvent;
Then you can extract the columns like this:
select Column1, Column2, Column3
from TempEvent;
Of course all of this processing would not be necessary if your source table was valid JSON originally, you could just create the TempEvent table as as an external table and pull data directly from it.

how to parse json using json_populate_recordset in postgres

I have a json stored as text in one of my database row. the json data is as following
[{"id":67272,"name":"EE_Quick_Changes_J_UTP.xlsx"},{"id":67273,"name":"16167.txt"},{"id":67274,"name":"EE_12_09_2013_Bcum_Searchall.png"}]
to parse this i want to use postgresql method
json_populate_recordset()
when I post a command like
select json_populate_recordset(null::json,'[{"id":67272,"name":"EE_Quick_Changes_J_UTP.xlsx"},{"id":67273,"name":"16167.txt"},{"id":67274,"name":"EE_12_09_2013_Bcum_Searchall.png"}]') from anoop;
it gives me following error
first argument of json_populate_recordset must be a row type
note : in the from clause "anoop" is the table name.
can anyone suggest me how to use the json_populate_recordset method to extract data from this json string.
I got method's reference from
http://www.postgresql.org/docs/9.3/static/functions-json.html
The first argument passed to pgsql function json_populate_recordsetshould be a row type. If you want to use the json array to populate the existing table anoop you can simply pass the table anoop as the row type like this:
insert into anoop
select * from json_populate_recordset(null::anoop,
'[{"id":67272,"name":"EE_Quick_Changes_J_UTP.xlsx"},
{"id":67273,"name":"16167.txt"},
{"id":67274,"name":"EE_12_09_2013_Bcum_Searchall.png"}]');
Here the null is the default value to insert into table columns not set in the json passed.
If you don't have an existing table, you need to create a row type to hold your json data (ie. column
names and their types) and pass it as the first parameter, like this anoop_type:
create TYPE anoop_type AS (id int, name varchar(100));
select * from json_populate_recordset(null :: anoop_type,
'[...]') --same as above
no need to create a new type for that.
select * from json_populate_recordset(null::record,'[{"id_item":1,"id_menu":"34"},{"id_item":2,"id_menu":"35"}]')
AS
(
id_item int
, id_menu int
)