Issue with creating athena tables from CSV files using glue - csv

I created a glue crawler to load multiple csv files of a S3 folder into 1 table on Athena and all the files are of same CSV format.
Am using crawler for that purpose using CSV classifier. But the files have columns with 'commas and double quotes' in between. Due to which the columns are not getting created properly in table as Crawler treats commas in column as separator.
But While creating table manually in Athena i was having option to give serde and give escape chars in table definition as below:
CREATE EXTERNAL TABLE IF NOT EXISTS dump_table as (
columns
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'escapeChar'='\\',
'separatorChar'=',')
LOCATION 's3://folder1//source'
TBLPROPERTIES (
'has_encrypted_data'='false',
'skip.header.line.count'='1'
);
Problem am facing is that am unable to give the escape character as comma in classifier for crawler and neither am able to give the serde information in crawler as how i gave while creating manual table.
Could anyone please help me with loading this CSV data into table which has columns with 'commas in between a column'

Related

How to deal with JSON with special characters in Column Names in AWS ATHENA

I'm new to athena even though I have some short experience with Hive.
I'm trying to create a table from JSON files, which are exports from MongoDB. My problem is that MongoDB uses $oid, $numberInt, $numberDoble and others as internal references, but '$' is not accepted in a column name in Athena.
This is a one line JSON file that I created to test:
{"_id":{"$oid":"61f87ebdf655d153709c9e19"}}
and this is the table that referes to it:
CREATE EXTERNAL TABLE landing.json_table (
`_id` struct<`$oid`:string>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://bucket-name/test/';
When I run a simple SELECT * it returns this error:
HIVE_METASTORE_ERROR: Error: name expected at the position 7 of
'struct<$oid:string>' but '$' is found. (Service: null; Status Code:
0; Error Code: null; Request ID: null; Proxy: null)
Which is related to the fact that the JSON column contains the $.
Any idea on how to handle the situation? My only resolution for now is to create a script which "clean" the json file from the unaccepted characters but I would really prefer to handle it directly in Athena if possible
If you switch to the OpenX SerDe, you can create a SerDe mapping for JSON fields with special characters like $ in the name.
See AWS Blog entry Create Tables in Amazon Athena from Nested JSON and Mappings Using JSONSerDe , section "Walkthrough: Handling forbidden characters with mappings".
A mapping that would work for your example:
CREATE EXTERNAL TABLE landing.json_table (
`_id` struct<`oid`:string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
"mapping.oid"="$oid"
)
LOCATION 's3://bucket-name/test/';

Export non-varchar data to CSV table using Trino (formerly PrestoDB)

I am working on some benchmarks and need to compare ORC, Parquet and CSV formats. I have exported TPC/H (SF1000) to ORC based tables. When I want to export it to Parquet I can run:
CREATE TABLE hive.tpch_sf1_parquet.region
WITH (format = 'parquet')
AS SELECT * FROM hive.tpch_sf1_orc.region
When I try the similar approach with CSV, then I get the error Hive CSV storage format only supports VARCHAR (unbounded). I would assumed that it would convert the other datatypes (i.e. bigint) to text and store the column format in the Hive metadata.
I can export the data to CSV using trino --server trino:8080 --catalog hive --schema tpch_sf1_orc --output-format=CSV --execute 'SELECT * FROM nation, but then it gets emitted to a file. Although this works for SF1 it quickly becomes unusable for SF1000 scale-factor. Another disadvantage is that my Hive metastores wouldn't have the appropriate meta-data (although I could patch it manually if nothing else works).
Anyone an idea how to convert my ORC/Parquet data to CSV using Hive?
In Trino Hive connector, the CSV table can contain varchar columns only.
You need to cast the exported columns to varchar when creating the table
CREATE TABLE region_csv
WITH (format='CSV')
AS SELECT CAST(regionkey AS varchar), CAST(name AS varchar), CAST(comment AS varchar)
FROM region_orc
Note that you will need to update your benchmark queries accordingly, e.g. by applying reverse casts.
DISCLAIMER: Read the full post, before using anything discussed here. It's not real CSV and you migth screw up!
It is possible to create typed CSV-ish tables when using the TEXTFILE format and use ',' as the field separator:
CREATE TABLE hive.test.region (
regionkey bigint,
name varchar(25),
comment varchar(152)
)
WITH (
format = 'TEXTFILE',
textfile_field_separator = ','
);
This will create a typed version of the table in the Hive catalog using the TEXTFILE format. It normally uses the ^A character (ASCII 10), but when set to ',' it resembles the same structure as CSV formats.
IMPORTANT: Although it looks like CSV, it is not real CSV. It doesn't follow RFC 4180, because it doesn't properly quote and escape. The following INSERT will not be inserted co:
INSERT INTO hive.test.region VALUES (
1,
'A "quote", with comma',
'The comment contains a newline
in it');
The text will be copied unmodified to the file without escaping quotes or commas. This should have been written like this to be proper CSV:
1,"A ""quote"", with comma","The comment contains a newline
in it"
Unfortunately, it is written as:
1,A "quote", with comma,The comment contains a newline
in it
This results in invalid data that will be represented by NULL columns. For this reason, this method can only be used when you have full control over the text-based data and are sure that it doesn't contain newlines, quotes, commas, ...

How to load json snappy compressed in HIVE

I have a bunch of json snappy compressed files in HDFS.
They are HADOOP snappy compressed (not python, cf other SO questions)
and have nested structures.
Could not find a method to load them into
into HIVE (using json_tuple) ?
Can I get some ressources/hints on how to load them
Previous references (does not have valid answers)
pyspark how to load compressed snappy file
Hive: parsing JSON
Put all files in HDFS folder and create external table on top of it. If files have names like .snappy Hive will automatically recognize them. You can specify SNAPPY output format for writing table:
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
CREATE EXTERNAL TABLE mydirectory_tbl(
id string,
name string
)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/mydir' --this is HDFS/S3 location
;
JSONSerDe can parse all complex structures, it is much easier than using json_tuple. Simple attributes in json are mapped to columns as is All in the square brackets [] is an array<>, in {} is a struct<> or map<>, complex types can be nested. Carefully read Readme: https://github.com/rcongiu/Hive-JSON-Serde. There is a section about nested structures and many examples of CREATE TABLE.
If you still want to use json_tuple, then create table with single STRING column then parse using json_tuple. But it is much more difficult.
All JSON records should be in single line (no newlines inside JSON objects, as well as \r) . The same is mentioned here https://github.com/rcongiu/Hive-JSON-Serde
If your data is partitioned (ex. by date)
Create the table in Hive
CREATE EXTERNAL TABLE IF NOT EXISTS database.table (
filename STRING,
cnt BIGINT,
size DOUBLE
) PARTITIONED BY ( \`date\` STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'folder/path/in/hdfs'
Recover the partition (before the recovery, the table seems to be empty)
MSCK REPAIR TABLE database.table

Different column oder in CSV file. CSV header column names as Hive table column names?

I'm receiving same CSV's but with different column order.
My external hive tables are defined in DDL scripts using the same names as CSV column names. But it seems, it does not map the data by the CSV column name and "reading" that CSV file. I thought there would be any possible way how to make it?
drop table db.table;
CREATE EXTERNAL TABLE db.table(`id` string, `text` string, `command` string)
row format SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ('separatorChar'=',', 'quoteChar'='"','escapeChar'='\\')
stored as textfile
location ' ..... '
tblproperties ("skip.header.line.count"="1");
It would help so much.

Athena AWS bad field name and multiple folders with Hive DDL

I'm new into AWS Athena, and I'm trying to query multiple S3 buckets containing JSON files. I encountered a number of problems that don't have any answer in documentation (sadly their error log is not informative enough to try to solve it myself):
How to query a JSON field named with parenthesis? For example I have a field named "Capacity(GB)", and when I'm trying to include in the CREATE EXTERNAL statement I receive an error:
CREATE EXTERNAL TABLE IF NOT EXISTS test-scema.test_table (
`device`: string,
`Capacity(GB)`: string)
Your query has the following error(s):
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
java.lang.IllegalArgumentException: Error: : expected at the position
of 'Capacity(GB):string>' but '(' is found.
My files are located in sub folders in S3 in a following structure:
'location_name/YYYY/MM/DD/appstring/'
and I want to query all the dates of a specific app-string (out of many). is there any 'wildcard' I can use to replace the dates path?
Something like this:
LOCATION 's3://location_name/%/%/%/appstring/'
Do I have to load the raw data as-is using CREATE EXTERNAL TABLE, and only then query it, or I can add some WHERE statements build-in? Specifically is someting like this is possible:
CREATE EXTERNAL TABLE IF NOT EXISTS test_schema.test_table (
field1:string,
field2:string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://folder/YYYY/MM/DD/appstring'
WHERE field2='value'
What would be the outcomes in terms of billing? Cause right now I'm building this CREATE statement only to re-use the data in a SQL query once-again.
Thanks!
1. JSON field named with parenthesis
There is no need to create a field called Capacity(GB). Instead, create the field with a different name:
CREATE EXTERNAL TABLE test_table (
device string,
capacity string
)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties ( 'paths'='device,Capacity(GB)')
LOCATION 's3://xxx';
If you are using Nested JSON then you can use the Serde's mapping property (which I saw on issue with Hive Serde dealing nested structs):
CREATE external TABLE test_table (
top string,
inner struct<device:INT,
capacity:INT>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.capacity" = "Capacity(GB)"
)
LOCATION 's3://xxx';
This works nicely with an input of:
{ "top" : "123", "inner": { "Capacity(GB)": 12, "device":2}}
2. Subfolders
You cannot wildcard mid-path (s3://location_name/*/*/*/appstring/). The closest option is to use partitioned data but that would require a different naming format for your directories.
3. Creating tables
You cannot specify WHERE statements as part of the CREATE TABLE statement.
If your aim is to reduce data costs, then use partitioned data to reduce the number of files scanned or store in a column-based format such as Parquet.
For examples, see: Analyzing Data in S3 using Amazon Athena