I'm trying to load, filter and unload some json files using AWS Athena:
CREATE EXTERNAL TABLE IF NOT EXISTS
json_based_table(file_line string)
LOCATION 's3://<my_bucket>/<some_path/';
UNLOAD
(SELECT file_line from json_based_table limit 10)
TO 's3://<results_bucket>/samples/'
WITH (format = 'JSON');
Problem is the output is a set of files containing a json per line that has a single key "file_line" who's value is a json line from the original file as a string.
How do I UNLOAD such a table values only? (ignoring the column name I had to create to load the files)
It seems that by choosing
WITH (format = 'TEXTFILE');
I can get what I want.
Choosing JSON as a format is good for preserving the tabular structure of the table in a file and was a misleading name in this case.
Related
I was working on data from a source by loading it using an external table in snowflake. I am loading multiple CSV files by matching them using the pattern of the files. One of the files has lots of problems and I need to load the entire data and parse it on my own. How can I load all of the data into a single column?
I was trying to use:
CREATE OR REPLACE EXTERNAL TABLE my_db.public.tb2_434
WITH LOCATION = #mydb.public.blob_tb2_434/
FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = 'None' SKIP_HEADER = 1
RECORD_DELIMITER='NONE')
PATTERN='.*.tsv';
to match all files with the .tsv ending and remove the FIELD_DELIMITER by setting it to None, but this is creating a separate JSON for each data file instead of being loaded as a single column. How can I load all of the files data into a single column?
I have a bunch of json snappy compressed files in HDFS.
They are HADOOP snappy compressed (not python, cf other SO questions)
and have nested structures.
Could not find a method to load them into
into HIVE (using json_tuple) ?
Can I get some ressources/hints on how to load them
Previous references (does not have valid answers)
pyspark how to load compressed snappy file
Hive: parsing JSON
Put all files in HDFS folder and create external table on top of it. If files have names like .snappy Hive will automatically recognize them. You can specify SNAPPY output format for writing table:
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
CREATE EXTERNAL TABLE mydirectory_tbl(
id string,
name string
)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/mydir' --this is HDFS/S3 location
;
JSONSerDe can parse all complex structures, it is much easier than using json_tuple. Simple attributes in json are mapped to columns as is All in the square brackets [] is an array<>, in {} is a struct<> or map<>, complex types can be nested. Carefully read Readme: https://github.com/rcongiu/Hive-JSON-Serde. There is a section about nested structures and many examples of CREATE TABLE.
If you still want to use json_tuple, then create table with single STRING column then parse using json_tuple. But it is much more difficult.
All JSON records should be in single line (no newlines inside JSON objects, as well as \r) . The same is mentioned here https://github.com/rcongiu/Hive-JSON-Serde
If your data is partitioned (ex. by date)
Create the table in Hive
CREATE EXTERNAL TABLE IF NOT EXISTS database.table (
filename STRING,
cnt BIGINT,
size DOUBLE
) PARTITIONED BY ( \`date\` STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'folder/path/in/hdfs'
Recover the partition (before the recovery, the table seems to be empty)
MSCK REPAIR TABLE database.table
I am trying to load External JSON File from Azure Blob Storage to Snowflake. I created the table LOCATION_DETAILS with all columns as Variant. When I try to load into the table, I am getting the below error:
Can anyone help me on this?
You need to create a file format and mention the type of file and other specification like below:
create or replace file format myjsonformat
type = 'JSON'
strip_outer_array = true;
And then try to load the file it will work.
When I use external data for Snowflake, I like to create stages that are linked to the BlobStorage (in this case), it's easy and you can do everything really easy and transparent, just as if it would be local data.
Create the stage linked to the blobstorage like this:
CREATE OR REPLACE STAGE "<DATABASE>"."<SCHEMA>"."<STAGE_NAME>"
URL='azure://demostorage178.blob.core.windows.net/democontainer'
CREDENTIALS=(AZURE_SAS_TOKEN='***********************************************')
FILE_FORMAT = (TYPE = JSON);
After that, you can list what is in the blobstorage fromo snowflake like this:
list #"<DATABASE>"."<SCHEMA>"."<STAGE_NAME>";
Or like this:
use database "<DATABASE>";
use schema "<SCHEMA>";
SELECT * FROM #"STAGE_NAME"/sales.json;
If you need to create the table, use this:
create or replace table "<DATABASE>"."<SCHEMA>"."<TABLE>" (src VARIANT);
And you can COPY your data like this (for a single file):
copy into "<DATABASE>"."<SCHEMA>"."<TABLE>" from #"<STAGE_NAME>"/sales.json;
Finally, use this for all new data that you get in your stage. Note: You don't need to erase previous data, it will ignore it and will load only the new one.
copy into "<DATABASE>"."<SCHEMA>"."<TABLE>" from #"STAGE_NAME";
I'm new into AWS Athena, and I'm trying to query multiple S3 buckets containing JSON files. I encountered a number of problems that don't have any answer in documentation (sadly their error log is not informative enough to try to solve it myself):
How to query a JSON field named with parenthesis? For example I have a field named "Capacity(GB)", and when I'm trying to include in the CREATE EXTERNAL statement I receive an error:
CREATE EXTERNAL TABLE IF NOT EXISTS test-scema.test_table (
`device`: string,
`Capacity(GB)`: string)
Your query has the following error(s):
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
java.lang.IllegalArgumentException: Error: : expected at the position
of 'Capacity(GB):string>' but '(' is found.
My files are located in sub folders in S3 in a following structure:
'location_name/YYYY/MM/DD/appstring/'
and I want to query all the dates of a specific app-string (out of many). is there any 'wildcard' I can use to replace the dates path?
Something like this:
LOCATION 's3://location_name/%/%/%/appstring/'
Do I have to load the raw data as-is using CREATE EXTERNAL TABLE, and only then query it, or I can add some WHERE statements build-in? Specifically is someting like this is possible:
CREATE EXTERNAL TABLE IF NOT EXISTS test_schema.test_table (
field1:string,
field2:string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://folder/YYYY/MM/DD/appstring'
WHERE field2='value'
What would be the outcomes in terms of billing? Cause right now I'm building this CREATE statement only to re-use the data in a SQL query once-again.
Thanks!
1. JSON field named with parenthesis
There is no need to create a field called Capacity(GB). Instead, create the field with a different name:
CREATE EXTERNAL TABLE test_table (
device string,
capacity string
)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties ( 'paths'='device,Capacity(GB)')
LOCATION 's3://xxx';
If you are using Nested JSON then you can use the Serde's mapping property (which I saw on issue with Hive Serde dealing nested structs):
CREATE external TABLE test_table (
top string,
inner struct<device:INT,
capacity:INT>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.capacity" = "Capacity(GB)"
)
LOCATION 's3://xxx';
This works nicely with an input of:
{ "top" : "123", "inner": { "Capacity(GB)": 12, "device":2}}
2. Subfolders
You cannot wildcard mid-path (s3://location_name/*/*/*/appstring/). The closest option is to use partitioned data but that would require a different naming format for your directories.
3. Creating tables
You cannot specify WHERE statements as part of the CREATE TABLE statement.
If your aim is to reduce data costs, then use partitioned data to reduce the number of files scanned or store in a column-based format such as Parquet.
For examples, see: Analyzing Data in S3 using Amazon Athena
Actually, I'm sending data to Cosmos via Cygnus. The Cosmos directory where Cygnus put the data is, for example, /user/myUser/mysetdata. I've created my hive table with this columns: recvTimeTs, recvTime, entityId, entityType, attrName, attrType, attrValue.
Now, I want to put data into Cosmos directly via HttpFS to the same directory that is putting Cygnus.
How could be the ".txt" file format? It have to be comma delimited? For example:
recvTimeTs;recvTimem;entityId;entityType;attrName;attrType;attrValue
value;value;value;...
Hive tables contains the structured data within files located in the HDFS folder which was given in the Hive table creation command.
With Cygnus 0.1, such structured data is achieved by using CSV-like files, thus adding a new file to the HDFS folder or appending new data to an already existent file within that folder is as easy as composing new CSV-like lines of data. The separator character must be the same you specified when creating the table, e.g.:
create external table <table_name> (recvTimeTs bigint, recvTime string, entityId string, entityType string, attrName string, attrType string, attrValue string) row format delimited fields terminated by '|' location '/user/<myusername>/<mydataset>/';
Thus, being the example separator |, the new data lines must be like:
<ts>|<ts_ms>|<entity_name>|<entity_type>|<attribute_name>|<attribute_type>|<value>
From Cugnus 0.2 (inclusive), the structured data is achieved by using Json-like files. In this case you do not have to deal with separators, nor table creation (see this question), since Json does not use separators and the table creation is automatic. In this case, you have to compose a new file or new data to be appended to an already existing file by following any of this formats (depending if you are storing the data in row or column mode, respectively):
{"recvTimeTs":"13453464536", "recvTime":"2014-02-27T14:46:21", "entityId":"Room1", "entityType":"Room", "attrName":"temperature", "attrType":"centigrade", "attrValue":"26.5", "attrMd":[{name:ID, type:string, value:ground}]}
{"recvTime":"2014-02-27T14:46:21", "temperature":"26.5", "temperature_md":[{"name":"ID", "type":"string", "value":"ground"}]}
It is worth mentioning there exists scripts in charge of moving 0.1-like format into 0.2-like (or higher) format.