Athena - DATE column correct values from JSON - json

I have a S3 bucket with many JSON files.
JSON file example:
{"id":"x109pri", "import_date":"2017-11-06"}
The "import_date" field is DATE type in standard format YYYY-MM-DD.
I am creating a Database connection in Athena to link all these JSON files.
However, when I create a new table in Athena and specify this field format as DATE I get: "Internal error" with no other explanation provided. To clarify, the table gets created just fine but if I want to preview it or query, I get this error.
However, when I specify this field as STRING then it works fine.
So the question is, is this a BUG or what should be the correct value for Athena DATE format?

The date column type does not work with certain combinations of SerDe and/or data source.
For example using a DATE column with org.openx.data.jsonserde.JsonSerDe fails, while org.apache.hive.hcatalog.data.JsonSerDe works.
So with the following table definition, querying your JSON will work.
create external table datetest(
id string,
import_date date
)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://bucket/datetest'

Related

SQL compilation error: JSON file format can produce one and only one column of type variant or object or array when copying from S3 to Snowflake

I have the following JSON stored in S3:
{"data":"this is a test for firehose"}
I have created the table test_firehose with a varchar column data, and a file_format called JSON with type JSON and the rest in default values. I want to copy the content from s3 to snowflake, and I have tried with the following statement:
COPY INTO test_firehose
FROM 's3://s3_bucket/firehose/2020/12/30/09/tracking-1-2020-12-30-09-38-46'
FILE_FORMAT = 'JSON';
And I receive the error:
SQL compilation error: JSON file format can produce one and only one column of type
variant or object or array. Use CSV file format if you want to load more than one column.
How could I solve this? Thanks
If you want to keep your data as JSON (rather than just as text) then you need to load it into a column with a datatype of VARIANT, not VARCHAR

How to split a column into two columns in SSIS if any invalid data in the column

Iam trying to load data from CSV file and dumping into database. While reading date values column from the CSV file getting some error because of CSV file contains some invalid data like '31-FEB-2014'.So i need to store those invalid data into another column in the table, how to achieve it using SSIS.Please assist.
Make a new column on your table which is of datatype nvarchar. Map your CSV Source column to the new column.
Then afterwards you can do some magic. Example you could use a derived column to handle the new nvarchar value and convert it back to a decent date format and then map it to your original column.
You just need to redirect it. See the red arrow and drag it to other destination like below:
Set properties like below:
tag me incase you're stuck.

Athena AWS bad field name and multiple folders with Hive DDL

I'm new into AWS Athena, and I'm trying to query multiple S3 buckets containing JSON files. I encountered a number of problems that don't have any answer in documentation (sadly their error log is not informative enough to try to solve it myself):
How to query a JSON field named with parenthesis? For example I have a field named "Capacity(GB)", and when I'm trying to include in the CREATE EXTERNAL statement I receive an error:
CREATE EXTERNAL TABLE IF NOT EXISTS test-scema.test_table (
`device`: string,
`Capacity(GB)`: string)
Your query has the following error(s):
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
java.lang.IllegalArgumentException: Error: : expected at the position
of 'Capacity(GB):string>' but '(' is found.
My files are located in sub folders in S3 in a following structure:
'location_name/YYYY/MM/DD/appstring/'
and I want to query all the dates of a specific app-string (out of many). is there any 'wildcard' I can use to replace the dates path?
Something like this:
LOCATION 's3://location_name/%/%/%/appstring/'
Do I have to load the raw data as-is using CREATE EXTERNAL TABLE, and only then query it, or I can add some WHERE statements build-in? Specifically is someting like this is possible:
CREATE EXTERNAL TABLE IF NOT EXISTS test_schema.test_table (
field1:string,
field2:string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://folder/YYYY/MM/DD/appstring'
WHERE field2='value'
What would be the outcomes in terms of billing? Cause right now I'm building this CREATE statement only to re-use the data in a SQL query once-again.
Thanks!
1. JSON field named with parenthesis
There is no need to create a field called Capacity(GB). Instead, create the field with a different name:
CREATE EXTERNAL TABLE test_table (
device string,
capacity string
)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties ( 'paths'='device,Capacity(GB)')
LOCATION 's3://xxx';
If you are using Nested JSON then you can use the Serde's mapping property (which I saw on issue with Hive Serde dealing nested structs):
CREATE external TABLE test_table (
top string,
inner struct<device:INT,
capacity:INT>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.capacity" = "Capacity(GB)"
)
LOCATION 's3://xxx';
This works nicely with an input of:
{ "top" : "123", "inner": { "Capacity(GB)": 12, "device":2}}
2. Subfolders
You cannot wildcard mid-path (s3://location_name/*/*/*/appstring/). The closest option is to use partitioned data but that would require a different naming format for your directories.
3. Creating tables
You cannot specify WHERE statements as part of the CREATE TABLE statement.
If your aim is to reduce data costs, then use partitioned data to reduce the number of files scanned or store in a column-based format such as Parquet.
For examples, see: Analyzing Data in S3 using Amazon Athena

MySQL to GeoMesa through .csv

I have a MySQL table whose data I have to export to .csv and then ingest this .csv to GeoMesa.
My Mysql table structure is like below:
[
Now, as you can see the the_geom attribute of table has data type point and in database it is stored as blob like shown below:
Now I have two problems :
When I export the MySQL data into a (.csv) file my csv file shows (...) for the_geom attribute as shown below instead of any binary representation or anything which will allow it to be ingested in GeoMesa. So, how to overcome this?
Csv file also shows # for any attribute with datetime datatype but if you expand the column the date time can be seen as sown in below picture (however my question is does it will cause problem in geomesa?).
For #1, MySQL's export is not automatically converting the Point datatype into text for you. You might need to call a conversion function such as AsWKT to output the geometry as Well Known Text. The WKT format can be used by GeoMesa to read in the Point data.
For #2, I think you'll need to do the same for the date field. Check out the date and time functions.

Importing a CSV with a timestamp field into MonetDB

I'm importing a CSV into MonetDB. I create a table called fx:
CREATE TABLE fx(ticktime timestamp,broker varchar(6),pair varchar(10),side varchar(1),price float,size tinyint,level tinyint)
and now I am trying to upload a large CSV file that does not have a header.
My sample.csv:
20150828 00:00:00.023,BRK1,EUR/USD,A,1.12437,1,1
20150828 00:00:00.023,BRK1,EUR/USD,A,1.12439,5,2
20150828 00:00:00.023,BRK1,EUR/USD,A,1.12441,9,3
My command:
sql>copy into fx from 'c:\fx\sample.csv' using delimiters ',','\n';
Failed to import table line 1 field 1 'timestamp(7)' expected in '20150828 00:00:00.023'
How do I upload this csv?
The timestamp format in your file is not the one MonetDB likes. So two options:
1) Change the type of ticktime to string:
CREATE TABLE fx(ticktime string, broker varchar(6),pair varchar(10),side varchar(1),price float,size tinyint,level tinyint);
COPY INTO ...
However, you would then need to convert the string column ticktime to a new column ticktimet of type timestamp using string manipulation, for example:
ALTER TABLE fx add column ticktimet timestamp;
UPDATE fx SET ticktimet=str_to_timestamp(ticktime , '%Y%m%d %H:%M:%S');
Note that this solution will discard the subsecond part (e.g. .023) from the timestamp, as this is currently not supported in str_to_timestamp.
2) Change the CSV to use a date format MonetDB likes, e.g.
2015-08-28 00:00:00.023,BRK1,EUR/USD,A,1.12437,1,1
2015-08-28 00:00:00.023,BRK1,EUR/USD,A,1.12439,5,2
2015-08-28 00:00:00.023,BRK1,EUR/USD,A,1.12441,9,3
Then, COPY INTO should work directly.