I'm new into AWS Athena, and I'm trying to query multiple S3 buckets containing JSON files. I encountered a number of problems that don't have any answer in documentation (sadly their error log is not informative enough to try to solve it myself):
How to query a JSON field named with parenthesis? For example I have a field named "Capacity(GB)", and when I'm trying to include in the CREATE EXTERNAL statement I receive an error:
CREATE EXTERNAL TABLE IF NOT EXISTS test-scema.test_table (
`device`: string,
`Capacity(GB)`: string)
Your query has the following error(s):
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
java.lang.IllegalArgumentException: Error: : expected at the position
of 'Capacity(GB):string>' but '(' is found.
My files are located in sub folders in S3 in a following structure:
'location_name/YYYY/MM/DD/appstring/'
and I want to query all the dates of a specific app-string (out of many). is there any 'wildcard' I can use to replace the dates path?
Something like this:
LOCATION 's3://location_name/%/%/%/appstring/'
Do I have to load the raw data as-is using CREATE EXTERNAL TABLE, and only then query it, or I can add some WHERE statements build-in? Specifically is someting like this is possible:
CREATE EXTERNAL TABLE IF NOT EXISTS test_schema.test_table (
field1:string,
field2:string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://folder/YYYY/MM/DD/appstring'
WHERE field2='value'
What would be the outcomes in terms of billing? Cause right now I'm building this CREATE statement only to re-use the data in a SQL query once-again.
Thanks!
1. JSON field named with parenthesis
There is no need to create a field called Capacity(GB). Instead, create the field with a different name:
CREATE EXTERNAL TABLE test_table (
device string,
capacity string
)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties ( 'paths'='device,Capacity(GB)')
LOCATION 's3://xxx';
If you are using Nested JSON then you can use the Serde's mapping property (which I saw on issue with Hive Serde dealing nested structs):
CREATE external TABLE test_table (
top string,
inner struct<device:INT,
capacity:INT>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.capacity" = "Capacity(GB)"
)
LOCATION 's3://xxx';
This works nicely with an input of:
{ "top" : "123", "inner": { "Capacity(GB)": 12, "device":2}}
2. Subfolders
You cannot wildcard mid-path (s3://location_name/*/*/*/appstring/). The closest option is to use partitioned data but that would require a different naming format for your directories.
3. Creating tables
You cannot specify WHERE statements as part of the CREATE TABLE statement.
If your aim is to reduce data costs, then use partitioned data to reduce the number of files scanned or store in a column-based format such as Parquet.
For examples, see: Analyzing Data in S3 using Amazon Athena
Related
I'm new to athena even though I have some short experience with Hive.
I'm trying to create a table from JSON files, which are exports from MongoDB. My problem is that MongoDB uses $oid, $numberInt, $numberDoble and others as internal references, but '$' is not accepted in a column name in Athena.
This is a one line JSON file that I created to test:
{"_id":{"$oid":"61f87ebdf655d153709c9e19"}}
and this is the table that referes to it:
CREATE EXTERNAL TABLE landing.json_table (
`_id` struct<`$oid`:string>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://bucket-name/test/';
When I run a simple SELECT * it returns this error:
HIVE_METASTORE_ERROR: Error: name expected at the position 7 of
'struct<$oid:string>' but '$' is found. (Service: null; Status Code:
0; Error Code: null; Request ID: null; Proxy: null)
Which is related to the fact that the JSON column contains the $.
Any idea on how to handle the situation? My only resolution for now is to create a script which "clean" the json file from the unaccepted characters but I would really prefer to handle it directly in Athena if possible
If you switch to the OpenX SerDe, you can create a SerDe mapping for JSON fields with special characters like $ in the name.
See AWS Blog entry Create Tables in Amazon Athena from Nested JSON and Mappings Using JSONSerDe , section "Walkthrough: Handling forbidden characters with mappings".
A mapping that would work for your example:
CREATE EXTERNAL TABLE landing.json_table (
`_id` struct<`oid`:string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
"mapping.oid"="$oid"
)
LOCATION 's3://bucket-name/test/';
I created a glue crawler to load multiple csv files of a S3 folder into 1 table on Athena and all the files are of same CSV format.
Am using crawler for that purpose using CSV classifier. But the files have columns with 'commas and double quotes' in between. Due to which the columns are not getting created properly in table as Crawler treats commas in column as separator.
But While creating table manually in Athena i was having option to give serde and give escape chars in table definition as below:
CREATE EXTERNAL TABLE IF NOT EXISTS dump_table as (
columns
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'escapeChar'='\\',
'separatorChar'=',')
LOCATION 's3://folder1//source'
TBLPROPERTIES (
'has_encrypted_data'='false',
'skip.header.line.count'='1'
);
Problem am facing is that am unable to give the escape character as comma in classifier for crawler and neither am able to give the serde information in crawler as how i gave while creating manual table.
Could anyone please help me with loading this CSV data into table which has columns with 'commas in between a column'
I have a bunch of json snappy compressed files in HDFS.
They are HADOOP snappy compressed (not python, cf other SO questions)
and have nested structures.
Could not find a method to load them into
into HIVE (using json_tuple) ?
Can I get some ressources/hints on how to load them
Previous references (does not have valid answers)
pyspark how to load compressed snappy file
Hive: parsing JSON
Put all files in HDFS folder and create external table on top of it. If files have names like .snappy Hive will automatically recognize them. You can specify SNAPPY output format for writing table:
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
CREATE EXTERNAL TABLE mydirectory_tbl(
id string,
name string
)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/mydir' --this is HDFS/S3 location
;
JSONSerDe can parse all complex structures, it is much easier than using json_tuple. Simple attributes in json are mapped to columns as is All in the square brackets [] is an array<>, in {} is a struct<> or map<>, complex types can be nested. Carefully read Readme: https://github.com/rcongiu/Hive-JSON-Serde. There is a section about nested structures and many examples of CREATE TABLE.
If you still want to use json_tuple, then create table with single STRING column then parse using json_tuple. But it is much more difficult.
All JSON records should be in single line (no newlines inside JSON objects, as well as \r) . The same is mentioned here https://github.com/rcongiu/Hive-JSON-Serde
If your data is partitioned (ex. by date)
Create the table in Hive
CREATE EXTERNAL TABLE IF NOT EXISTS database.table (
filename STRING,
cnt BIGINT,
size DOUBLE
) PARTITIONED BY ( \`date\` STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'folder/path/in/hdfs'
Recover the partition (before the recovery, the table seems to be empty)
MSCK REPAIR TABLE database.table
I have thousands of unusually formatted CSVs sitting in S3 that I need uploaded to Redshift.
The CSVs are formatted like so:
Column A Column B ..... Column Z
{"id": 2034823" "created": "2017-1-1" "result": true}
In other words, each row of the CSV is valid JSON.
I've tried a simple copy command, but to no avail. I tried to add the format as json 'auto'; flag, but still receiving errors:
Invalid Value: err_code 1216, line number 1, position 0
Is there a recommended way to handle CSVs in this format? I want to save them into an existing Redshift table that already has types defined
I have the same exact types of files. The steps I have followed to load them into a Redshift table like this
Create an external table in Redshift Spectrum table with struct
Insert into your Redshift table from the table above.
in your case
1.
CREATE EXTERNAL TABLE <spectrum schema>.<your external table>
(
data struct<
id:integer,
created:timestamp,
...
result:varchar(5)>
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties (
'dots.in.keys' = 'true',
'mapping.requesttime' = 'requesttimestamp')
as location 's3:<your S3 bucket>';
2.
INSERT INTO <your Redshift table>
SELECT data.id, data.created, ..., data.result
FROM <your external table>
See how to setup Redshift Spectrum
https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html
Let me know if you have further questions.
I am working on nested json data which i got from facebook using nifi. i have creating a table in hive and loading the data using the command.
CREATE TABLE abmediaanalysis (id string, posts struct< data:array<struct< message:string, shares:struct< count:int>, id:string, reactions:struct< data:array<struct< name:string,id:string>>, paging:struct< cursors:struct< before:string,after:string>, next:string>>, likes:struct< data:array<struct< id:string>>, paging:struct< cursors:struct< before:string,after:string>, next:string>>>>, paging:struct< previous:string,next:string>>, feed struct< data:array<struct< permalink_url:string,message:string,id:string>>, paging:struct< previous:string,next:string>>)ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
load data local inpath '/home/10879/facebook1480479682880.json' overwrite into table abmediaanalysis;
i have also added the jar file json-serde-1.3.8-jar-with-dependencies.jar,
but when i am using lateral view explode to print all the columns, i am getting java heap size error. i have also increased the heap size but still same error
select id,posts_message,posts_share_count,posts_id,feed_data_permalink_url,feed_data_message,feed_data_id,reaction_data_name,reaction_data_id,posts_likes_data_id from abmediaanalysis
LATERAL VIEW explode(posts.data.message)MSG as posts_message
LATERAL VIEW explode(posts.data.shares.count)CT as posts_share_count
LATERAL VIEW explode(posts.data.id) I as posts_id
LATERAL VIEW explode(feed.data.permalink_url) PU as feed_data_permalink_url
LATERAL VIEW explode(feed.data.message) MSG as feed_data_message
LATERAL VIEW explode(feed.data.id) I as feed_data_id
LATERAL VIEW explode(posts.data.reactions) NM as posts_reactions_name
LATERAL VIEW explode(posts_reactions_name.data.name) NM as reaction_data_name
LATERAL VIEW explode(posts_reactions_name.data.id) NM as reaction_data_id
LATERAL VIEW explode(posts.data.likes) I as likes_data_id
LATERAL VIEW explode(likes_data_id.data.id) I as posts_likes_data_id;
when i tried to print two or three columns instead of showing 616 records its showing 15625 records approx.
Can anyone help with this issue
Is there a chance to load the above json data directly from nifi into hive table? if so can you tell me.
Thanks in advance
If I may suggest you another approach to just load the whole JSON string into a column as String datatype into an external table.
e.g.
CREATE EXTERNAL TABLE json_data_table (
id String,
json_data String
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u0001' LINES TERMINATED BY '\n' STORED AS TEXTFILE
LOCATION '/path/to/json';
Use Hive get_json_object to extract individual columns. E.g.
If json_data column has below JSON string
{"store":
{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],
"bicycle":{"price":19.95,"color":"red"}
},
"email":"amy#only_for_json_udf_test.net",
"owner":"amy"
}
The below query fetches
SELECT get_json_object(json_data, '$.owner') FROM json_data_table;
returns amy
In this way you could extract each json element as column from the table.