Athena unable to parse date using OpenCSVSerde

Athena unable to parse date using OpenCSVSerde - csv

I have a very simple csv file on S3
"i","d","f","s"
"1","2018-01-01","1.001","something great!"
"2","2018-01-02","2.002","something terrible!"
"3","2018-01-03","3.003","I'm an oil man"
I'm trying to create a table across this using the following command
CREATE EXTERNAL TABLE test (i int, d date, f float, s string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION 's3://mybucket/test/'
TBLPROPERTIES ("skip.header.line.count"="1");
When I query the table (select * from test) I'm getting an error like this:
HIVE_BAD_DATA:
Error parsing field value '2018-01-01' for field 1: For input string: "2018-01-01"
Some more info:
If I change the d column to a string the query will succeed
I've previously parsed dates in text files using Athena; I believe using LazySimpleSerDe
Definitely seems like a problem with the OpenCSVSerde
The documentation definitely implies that this is supported. Looking for anyone who has encountered this, or any suggestions.

In fact, it is a problem with the documentation that you mentioned. You were probably referring to this excerpt:
[OpenCSVSerDe] recognizes the DATE type if it is specified in the UNIX
format, such as YYYY-MM-DD, as the type LONG.
Understandably, you were formatting your date as YYYY-MM-DD. However, the documentation is deeply misleading in that sentence. When it refers to UNIX format, it actually has UNIX Epoch Time in mind.
Based on the definition of UNIX Epoch, your dates should be integers (hence the reference to the type LONG in the documentation). Your dates should be the number of days that have elapsed since January 1, 1970.
For instance, your sample CSV should look like this:
"i","d","f","s"
"1","17532","1.001","something great!"
"2","17533","2.002","something terrible!"
"3","17534","3.003","I'm an oil man"
Then you can run that exact same command:
CREATE EXTERNAL TABLE test (i int, d date, f float, s string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION 's3://mybucket/test/'
TBLPROPERTIES ("skip.header.line.count"="1");
If you query your Athena table with select * from test, you will get:
i d f s
--- ------------ ------- ---------------------
1 2018-01-01 1.001 something great!
2 2018-01-02 2.002 something terrible!
3 2018-01-03 3.003 I'm an oil man
An analogous problem also compromises the explanation on TIMESTAMP in the aforementioned documentation:
[OpenCSVSerDe] recognizes the TIMESTAMP type if it is specified in the
UNIX format, such as yyyy-mm-dd hh:mm:ss[.f...], as the type LONG.
It seems to indicate that we should format TIMESTAMPs as yyyy-mm-dd hh:mm:ss[.f...]. Not really. In fact, we need to use UNIX Epoch Time again, but this time with the number of milliseconds that have elapsed since Midnight 1 January 1970.
For instance, consider the following sample CSV:
"i","d","f","s","t"
"1","17532","1.001","something great!","1564286638027"
"2","17533","2.002","something terrible!","1564486638027"
"3","17534","3.003","I'm an oil man","1563486638012"
And the following CREATE TABLE statement:
CREATE EXTERNAL TABLE test (i int, d date, f float, s string, t timestamp)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION 's3://mybucket/test/'
TBLPROPERTIES ("skip.header.line.count"="1");
This will be the result set for select * from test:
i d f s t
--- ------------ ------- --------------------- -------------------------
1 2018-01-01 1.001 something great! 2019-07-28 04:03:58.027
2 2018-01-02 2.002 something terrible! 2019-07-30 11:37:18.027
3 2018-01-03 3.003 I'm an oil man 2019-07-18 21:50:38.012

One way around is declare the d column as string and then in the select query use DATE(d) or date_parse to parse the value as date data type.

Related

Can MySQL remove rows from results based on whether a column value changes?

My goal is to create a dataset output that can be easily charted for easy human readability to discover trends.
Data from SELECT * comes out with a string key with the second column in the format:
[{timestamp: 2022-06-11 08:14:29.111-04:00, value: 5},
{timestamp: 2022-06-11 08:34:29.111-04:00, value: 5},
{timestamp: 2022-06-11 09:18:39.111-04:00, value: 3}]
The number of objects in that array will grow over time, and the values need to be tracked, but often will contain duplicate values as this functions as a tracking table for data backup purposes.
My goal is to end up in a format that can be easily read by humans to see changes over time such as:
Key value | timestamp-the-value-changed | value
or
Key Value | timestamp-the-value-first-appeared | last-timestamp-this-value-existed | value
Some of this can be done in Javascript if needed, but that part I can handle easier. Can anyone tell me how close I can get to this output without making the MySQL query too complex?
I have already come closer by flattening the dataset using a cross join on the JSON column, but that doesn't aid with the de-duplication.
As a workable example, the data I describe from the cross-join for the original example appears as:
Key
timestamp
value
"StringKey1"
2022-06-11 08:14:29.111-04:00
5
"StringKey1"
2022-06-11 08:34:29.111-04:00
5
"StringKey1"
2022-06-11 09:18:39.111-04:00
3

MS Access query: #Error in Switch Function - Format issue

I'm making a query that JOIN to table, A and B.
A contains below fields:
TAP - Short Text (AAAA, BBBB, etc)
Operator - Short Text
Zone - Short Text (Zone 01, Zone 02..)
B contains below one:
TAP - Short Text
MCC - Number (20210, 20032, etc)
My query is:
SELECT A.TAP, A.Operator, SWITCH(B.MCC='10020', 'Own Network', B.MCC, A.Zone) FROM A LEFT JOIN B ON A.TAP=B.TAP
Query result shows #Error value for all Zone values. I think that this is due to MCC is in Number format, because, when I change MCC to Short Text (althought there is not text, only number), the query dumps the correct Zone.. but I cannot change, because I have to use MCC in Number format for other queries.
TAP | Operator | Zone | MCC
AAAA | ATT | Zone 01 | 120001
BBBB | Two | Own Network | 10020
Any suggestion? Many thanks

I'm not really getting that SWITCH statement. The third argument (B.MCC) should evaluate to a boolean, and it doesn't.
Use IIF(Nz(B.MCC)=10020, 'Own Network', A.Zone). I believe that's what you intend to do.
Other problems fixed:
MCC is a number, '10020' a string. Removed apostrophes to make 10020 a number
MCC contained empty (Null) values. Used Nz to cast these values to 0

As per Erik suggestion, issue was resolved removing apostrophes to the number and addiding Nz in order to cast empty values to Zero.

How to balance out row mode and column mode in cygnus?

I have a weather-station that transmits data each hour. During that hour it makes four recordings (one every 15 minutes). In my current situation I am using attr_persistence=row to store data in my MySql database.
With row mode I get the default generated columns:
recvTimeTs | recvTime | entityId | entityType | attrName | attrType | attrValue | attrMd
But my weather station sends me the following data:
| attrName | attrValue
timeRecorded 14:30:0,22.5.2015
measurement1 18.799
measurement2 94.0
measurement3 1.19
These attrValue are represented in the database as string.
Is there a way to leave the three measurements in row mode and switch the timeRecorded to column mode? And if not, then what is my alternative?
The point of all this is to query the time recorded value, but I cannot query date as long as it is string.
As a side note: having the weather station send the data as soon as it is recorded (every 15 minutes) is out of the question, firstly because I need to conserve battery power and more importantly because in the case of a problem with the post, it will send all of the recordings at once.
So if an entire day went without sending any data, the weather station will send all 24*4 readings at once...

The proposed solution is to use the STR_TO_DATE function of MySQL in order to translate the stored string-based "timeRecorded" attribute into a real MySQL Timestamp type.
Nevertheless, "timeRecorded" attribute appears every 4 rows in the table due to the "row" attribute persistence mode of OrionMySQLSink. In this case I think you can use the ROWNUM keyword from MySQL in order to get only every 4 rows, something like (not an expert on MySQL):
SELECT STR_TO_DATE( attrValue, '%m/%d/%Y' ) FROM def_servpath_0004_weatherstation where (ROWNUM / 4 = 0);
The alternative is to move to "column" mode (in this case you have to provision de table by yourself). By using this mode you will have a single row with all the 4 attributes, being one of these attributes the "timeRecorded" one. In this case, you can provision the table by directly specifying the type of the "timeRecorded" column as Timestamp, instead of Text. That way, you will avoid the STR_TO-DATE part.

Apply value specified for date range (from, to) to value in table containing single row per date

First of all, sorry for title, but my english is too poor to explain meaning of my question. :)
Let's suppose that we have two tables. The first table tbl_percents contains percent value history during date ranges. If date field to equals 0000-00-00 it means that it is unfinished range.
table: tbl_percents
from date
to date
percent int
example content:
2001-01-01 | 2015-01-21 | 10%
2015-01-21 | 0000-00-00 | 20%
Second table is tbl_revenue which contains revenue values for specific date.
table: tbl_revenue
date date
revenue bigint
example content:
2014-01-10 | 10
2015-01-22 | 10
Now we want to apply percent specified in table tbl_percents to revenue. In result we want to get something like this:
2014-01-10 | 1 #because from 2001-01-01 to 2015-01-21 percent = 10%
2015-01-22 | 2 #because from 2015-01-22 till now percent = 20%
Is it possible to get this result in single SQL query?

Yep. You want to do a join using a BETWEEN condition. I have to caution you that these types of queries get very expensive, very fast, so you don't want to do this on a huge dataset. That being said, you can join your tables with something like the following:
SELECT b.revenue, a.percent
FROM tbl_percents AS a
INNER JOIN tbl_revenue AS b
ON b.date BETWEEN a.from_date AND
CASE WHEN a.to_date = DATE("0000-00-00") THEN DATE("2100-01-01")
ELSE a.to_date END
Basically what I'm doing is setting the to_date to something very large and in the future (namely Jan 01, 2100). If the to_date is 0000-00-00, then I apply the very large in the future date. Otherwise I just use the to_date. Using that, I join by revenue date to my percents table where the revenue date is between the percent start date and my modified percent end date.
Again, this is computationally not a good idea on a huge dataset... but for general purposes, it should work just fine. If you start having trouble with speed/performance, I'd suggest trying to apply similar logic using a scripting language like R or Python.
Best of luck!

Something like:
SELECT (CAST(COALESCE(SELECT [percent] FROM tbl_percents
WHERE tbl_revenue.[date] BETWEEN TO AND FROM OR [date] > TO
AND FROM = '0000-00-00' LIMIT 1),0) AS DECIMAL(12,2)) / 100) * revenue
AS MyNewVal FROM tbl_revenue
I can't test this where I am, but it might get you pointed in a good direction. I think you need to cast your int stored [percent] field to decimal to avoid 10/100==0 but it seems strait forward otherwise.

Is there a possibility to change the order of a string with numeric value

I have some strings in my database. Some of them have numeric values (but in string format of course). I am displaying those values ordered ascending.
So we know, for string values, 10 is greater than 2 for example, which is normal. I am asking if there is any solution to display 10 after 2, without changing the code or the database structure, only the data.
If for example I have to display values from 1 to 10, I will have:
1
10
2
3
4
5
6
7
8
9
What I would like to have is
1
2
3
4
5
6
7
8
9
10
Is there a possibility to ad an "invisible character or string which will be interpreted as greater than 9". If i put a10 instead of 10, the a10 will be at the end but is there any invisible or less visible character for that.
So, I repeat, I am not looking for a programming or database structure solution, but for a simple workaround.

You could try to cast the value as an number to then order by it:
select col
from yourtable
order by cast(col AS UNSIGNED)
See SQL Fiddle with demo

You could try appending the correct number of zeroes to the front of the data:
01
02
03
..
10
11
..
99

Since you have a mixture of numbers and letters in this column - even if not in a single row - what you're really trying to do is a Natural Sort. This is not something MySQL can do natively. There are some work arounds, however. The best I've come across are:
Sort by length then value.
SELECT
mixedColumn
FROM
tableName
ORDER BY
LENGTH(mixedColumn), mixedColumn;
For more examples see: http://www.copterlabs.com/blog/natural-sorting-in-mysql/
Use a secondary column to use as a sort key that would contain some sort of normalized data (i.e. only numbers or only letters).
CREATE TABLE tableName (mixedColumn varchar, sortColumn int);
INSERT INTO tableName VALUES ('1',1), ('2',2), ('10',3),
('a',4),('a1',5),('a2',6),('b1',7);
SELECT
mixedColumn
FROM
tableName
ORDER BY
sortColumn;
This could get difficult to maintain unless you can figure out a good way to handle the ordering.
Of course if you were able to go outside of the database you'd be able to use natural sort functions from various programming languages.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Athena unable to parse date using OpenCSVSerde - csv

One way around is declare the d column as string and then in the select query use DATE(d) or date_parse to parse the value as date data type.

Related

Can MySQL remove rows from results based on whether a column value changes?

MS Access query: #Error in Switch Function - Format issue

How to balance out row mode and column mode in cygnus?

Apply value specified for date range (from, to) to value in table containing single row per date

Is there a possibility to change the order of a string with numeric value

Categories

Resources