I have to load data in json format into hive. Problem is there exists a field which is a date which is different per record leading to all kinds of problems. The DDL for one record looks like:
CREATE EXTERNAL TABLE `not_really_awesome_table` (
`super_wtf` struct<
`10-02-2019`: string
>
`super_blah` struct <
`bleh`: string,
`blah`: string,
`sub_blah`: struct <
`blah_field`: string,
`bleh_field`: string
>
>
)
ROW FORMAT serde 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties ( 'ignore.malformed.json' = 'true' )
LOCATION
's3://wtf/is/this/lol'
TBLPROPERTIES (
'has_encrypted_data'='false',
'transient_lastDdlTime'='1539066055')
;
Is there a way to ignore the super-wtf field or cast it into some type which would avoid parsing it further?
You can skip super-wtf column in the DDL and add everything else:
CREATE EXTERNAL TABLE `not_really_awesome_table` (
`super_blah` struct <
`bleh`: string,
`blah`: string,
`sub_blah`: struct <
`blah_field`: string,
`bleh_field`: string
>
>
)
In this case it will not be parsed from JSON.
Or alternatively define super-wtf column as map<string, string> in the DDL
So I'm having trouble understanding why a resultSet from an executed query will not continue to advance when there's clearly more entries to iterate over. I have a table called all_projects, and when I run this query:
SELECT project_title, created_date, isActive FROM all_projects WHERE project_lead='myUser' ORDER BY created_date DESC;
in my PSQL shell, this is the result I get:
The column data types are: String, Timestamp, and Boolean.
I'm attempting to get each row, create an Array[String] from it, to create, ultimately, a list of an array of strings ( List[Array[String]] )
When I iterate over this with resultSet.next(), it is able to retrieve the first two values, but then when I call next after acquiring the first timestamp value, it fails and returns false.
Below is my code - riddled with a lot of println debug statements to see what happens, stack/and print trace will be at the bottom.
def getAll(userName: String, db: Database): List[Array[String]] = {
val tablesQuery = s"SELECT project_title, created_date, isActive FROM all_projects WHERE project_lead=? ORDER BY created_date DESC;"
var returnResult = new ListBuffer[Array[String]]
db.withConnection { conn =>
val ps = conn.prepareStatement(tablesQuery)
ps.setString(1, userName)
val qryResult = ps.executeQuery()
val columnCount = qryResult.getMetaData.getColumnCount
println("RETRIEVED THIS MANY COLUMNS: " + columnCount)
while (qryResult.next()) {
println("Achieved next in while loop >>>>>>>>>>>")
val row = new Array[String](columnCount)
for (i <- 0 to columnCount - 1) {
println(s"Inserting into Array($i) from the column index(${i + 1})")
if (i < 2) {
println("Tried to get string: ");
row(i) = qryResult.getString(i + 1)
} else { // note I also just tried to keep it all as .getString()
println("Tried to get boolean: ");
row(i) = qryResult.getBoolean(i+1).toString
} // retrieve by column index SQL columns start at 1
println("Row before we move on: " + row.mkString(", "))
if (i <= columnCount - 2) {
println("Called next? -> " + qryResult.next())
}
}
returnResult += row
}
}
returnResult.toList
}
And here is the resulting print stack, which should have been fine, but as you can see, returns false when attempting next() when the cursor is on the first timestamp value.
THIS MANY COLUMNS: 3
Achieved next in while loop >>>>>>>>>>>
Inserting into Array(0) from the column index(1)
Tried to get string:
Row before we move on: Wild Elephants of Mexico, null, null
Called next? -> true
Inserting into Array(1) from the column index(2)
Tried to get string:
Row before we move on: Wild Elephants of Mexico, 2017-08-05 11:00:44.078232, null
Called next? -> false
Inserting into Array(2) from the column index(3)
Tried to get boolean:
[error] o.j.ResultSetLogger - java.sql.ResultSet.getBoolean:
throws exception: org.postgresql.util.PSQLException: ResultSet not positioned properly, perhaps you need to call next.
org.postgresql.util.PSQLException: ResultSet not positioned properly, perhaps you need to call next.
What is happening here?
You are calling qryResult.next() within what you mean to be the handling of a single row, breaking everything. (Boy are you making your life much too difficult.)
A ResultSet represents a query result as a movable pointer to a single row of the query result. You handle rows one at a time, and then call next() only when you have fully completed handling of that row.
Let's make things much much simpler. (I'm just writing this out in a web page, don't have time to check or compile it, so please excuse my boo-boos.)
def handleRow( qryResult : ResultSet, columnCount : Int ) : Array[String] = {
(1 to columnCount).map( i => qryResult.getString(i) ).toArray
}
def getAll(userName: String, db: Database): List[Array[String]] = {
val tablesQuery = s"SELECT project_title, created_date, isActive FROM all_projects WHERE project_lead=? ORDER BY created_date DESC;"
db.withConnection { conn =>
val ps = conn.prepareStatement(tablesQuery)
ps.setString(1, userName)
val qryResult = ps.executeQuery()
val columnCount = qryResult.getMetaData.getColumnCount
println("RETRIEVED THIS MANY COLUMNS: " + columnCount)
var buffer = new ListBuffer[Array[String]]
while (qryResult.next()) {
buffer += handleRow( qryResult, columnCount )
}
buffer.toList
}
}
I have a hive table to load JSON data. There are two values in my JSON. Both have data type as string. If I keep them as bigint, then select on this table gives below error:
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Current token (VALUE_STRING) not numeric, can not use numeric value accessors
at [Source: java.io.ByteArrayInputStream#3b6c740b; line: 1, column: 21]
If I change it two string, then it works OK.
Now, because these columns are in string, I am not able to use from_unixtime method for these columns.
If I try to alter these columns data types from string to bigint, I get below error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. The following columns have types incompatible with the existing columns in their respective positions : uploadtimestamp
Below is my create table statement:
create table ABC
(
uploadTimeStamp bigint
,PDID string
,data array
<
struct
<
Data:struct
<
unit:string
,value:string
,heading:string
,loc:string
,loc1:string
,loc2:string
,loc3:string
,speed:string
,xvalue:string
,yvalue:string
,zvalue:string
>
,Event:string
,PDID:string
,`Timestamp`:string
,Timezone:string
,Version:string
,pii:struct<dummy:string>
>
>
)
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
stored as textfile;
My JSON:
{"uploadTimeStamp":"1488793268598","PDID":"123","data":[{"Data":{"unit":"rpm","value":"100"},"EventID":"E1","PDID":"123","Timestamp":1488793268598,"Timezone":330,"Version":"1.0","pii":{}},{"Data":{"heading":"N","loc":"false","loc1":"16.032425","loc2":"80.770587","loc3":"false","speed":"10"},"EventID":"Location","PDID":"skga06031430gedvcl1pdid2367","Timestamp":1488793268598,"Timezone":330,"Version":"1.1","pii":{}},{"Data":{"xvalue":"1.1","yvalue":"1.2","zvalue":"2.2"},"EventID":"AccelerometerInfo","PDID":"skga06031430gedvcl1pdid2367","Timestamp":1488793268598,"Timezone":330,"Version":"1.0","pii":{}},{"EventID":"FuelLevel","Data":{"value":"50","unit":"percentage"},"Version":"1.0","Timestamp":1488793268598,"PDID":"skga06031430gedvcl1pdid2367","Timezone":330},{"Data":{"unit":"kmph","value":"70"},"EventID":"VehicleSpeed","PDID":"skga06031430gedvcl1pdid2367","Timestamp":1488793268598,"Timezone":330,"Version":"1.0","pii":{}}]}
Any ways I can convert this string unixtimestamp to standard time or I can work with bigint for these columns?
If you are talking about Timestamp and Timezone then you can define them as int/big int types.
If you'll look on their definition you'll see that there are no qualifiers (") around the values, therefore they are of numeric types within in the JSON doc:
"Timestamp":1488793268598,"Timezone":330
create external table myjson
(
uploadTimeStamp string
,PDID string
,data array
<
struct
<
Data:struct
<
unit:string
,value:string
,heading:string
,loc3:string
,loc:string
,loc1:string
,loc4:string
,speed:string
,x:string
,y:string
,z:string
>
,EventID:string
,PDID:string
,`Timestamp`:bigint
,Timezone:smallint
,Version:string
,pii:struct<dummy:string>
>
>
)
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
stored as textfile
location '/tmp/myjson'
;
+------------------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| myjson.uploadtimestamp | myjson.pdid | myjson.data |
+------------------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 1486631318873 | 123 | [{"data":{"unit":"rpm","value":"0","heading":null,"loc3":null,"loc":null,"loc1":null,"loc4":null,"speed":null,"x":null,"y":null,"z":null},"eventid":"E1","pdid":"123","timestamp":1486631318873,"timezone":330,"version":"1.0","pii":{"dummy":null}},{"data":{"unit":null,"value":null,"heading":"N","loc3":"false","loc":"14.022425","loc1":"78.760587","loc4":"false","speed":"10","x":null,"y":null,"z":null},"eventid":"E2","pdid":"123","timestamp":1486631318873,"timezone":330,"version":"1.1","pii":{"dummy":null}},{"data":{"unit":null,"value":null,"heading":null,"loc3":null,"loc":null,"loc1":null,"loc4":null,"speed":null,"x":"1.1","y":"1.2","z":"2.2"},"eventid":"E3","pdid":"123","timestamp":1486631318873,"timezone":330,"version":"1.0","pii":{"dummy":null}},{"data":{"unit":"percentage","value":"50","heading":null,"loc3":null,"loc":null,"loc1":null,"loc4":null,"speed":null,"x":null,"y":null,"z":null},"eventid":"E4","pdid":"123","timestamp":1486631318873,"timezone":330,"version":"1.0","pii":null},{"data":{"unit":"kmph","value":"70","heading":null,"loc3":null,"loc":null,"loc1":null,"loc4":null,"speed":null,"x":null,"y":null,"z":null},"eventid":"E5","pdid":"123","timestamp":1486631318873,"timezone":330,"version":"1.0","pii":{"dummy":null}}] |
+------------------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Even if you have defined Timestamp as a string you can still cast it to a bigint before using it in a function that requires a bigint.
cast (`Timestamp` as bigint)
hive> with t as (select '0' as `timestamp`) select from_unixtime(`timestamp`) from t;
FAILED: SemanticException [Error 10014]: Line 1:45 Wrong arguments
'timestamp': No matching method for class
org.apache.hadoop.hive.ql.udf.UDFFromUnixTime with (string). Possible
choices: FUNC(bigint) FUNC(bigint, string) FUNC(int)
FUNC(int, string)
hive> with t as (select '0' as `timestamp`) select from_unixtime(cast(`timestamp` as bigint)) from t;
OK
1970-01-01 00:00:00