Loading Json Array File in Hive - json

I have a file which contains the data as follows
[{"col1":"col1","col2":1}
,{"col1":"col11","col2":11}
,{"col1":"col111","col2":2}
]
I am trying to load the table in Hive.
I am using following Hive serde
CREATE EXTERNAL TABLE my_table (
my_array ARRAY<struct<col1:string,col2:int>>
)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES ( "ignore.malformed.json" = "true")
LOCATION "MY_LOCATION";
I am getting error when I try to run select * after running the create command -
['*org.apache.hive.service.cli.HiveSQLException:java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected:25:24', 'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:499', 'org.apache.hive.service.cli.operation.OperationManager:getOperationNextRowSet:OperationManager.java:307', 'org.apache.hive.service.cli.session.HiveSessionImpl:fetchResults:HiveSessionImpl.java:878', 'sun.reflect.GeneratedMethodAccessor29:invoke::-1', 'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', 'java.lang.reflect.Method:invoke:Method.java:498', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78', 'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36', 'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63', 'java.security.AccessController:doPrivileged:AccessController.java:-2', 'javax.security.auth.Subject:doAs:Subject.java:422', 'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1698', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59', 'com.sun.proxy.$Proxy35:fetchResults::-1', 'org.apache.hive.service.cli.CLIService:fetchResults:CLIService.java:559', 'org.apache.hive.service.cli.thrift.ThriftCLIService:FetchResults:ThriftCLIService.java:751', 'org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1717', 'org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1702', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624', 'java.lang.Thread:run:Thread.java:748', '*java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected:29:4', 'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:521', 'org.apache.hadoop.hive.ql.exec.FetchOperator:pushRow:FetchOperator.java:428', 'org.apache.hadoop.hive.ql.exec.FetchTask:fetch:FetchTask.java:147', 'org.apache.hadoop.hive.ql.Driver:getResults:Driver.java:2207', 'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:494', '*org.apache.hadoop.hive.serde2.SerDeException:java.io.IOException: Start token not found where expected:30:1', 'org.apache.hive.hcatalog.data.JsonSerDe:deserialize:JsonSerDe.java:184', 'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:502', '*java.io.IOException:Start token not found where expected:30:0', 'org.apache.hive.hcatalog.data.JsonSerDe:deserialize:JsonSerDe.java:170'], statusCode=3), results=None, hasMoreRows=None)
I tried several things, none of which worked as expected. I can't change the input data format as it is someone else who is providing the data.

This is a malformed JSON issue. A JSON file will always have "curly braces" at the beginning and the end. So change your JSON file to look something like below.
{"my_array":[{"col1":"col1","col2":1},{"col1":"col11","col2":11},{"col1":"col111","col2":2}]}
Create your table in the exact same way as you are doing it already.
CREATE EXTERNAL TABLE my_table
(
my_array ARRAY<struct<col1:string,col2:int>>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES ( "ignore.malformed.json" = "true")
LOCATION "MY_LOCATION";
Now fire a select * on your newly created table to see following results.
[{"col1":"col1","col2":1},{"col1":"col11","col2":11},{"col1":"col111","col2":2}]
Use select my_array.col1 from my_table; to see the values for col1 from your array.
["col1","col11","col111"]
PS - Not the most efficient way to store the data. Consider transforming the data and storing it as ORC/Parquet.
Hope that helps!

Looks like the issue is with your json data. Can you try with below example?
Create employee json with below content and place it in hdfs.
[root#quickstart spark]# hadoop fs -cat /user/cloudera/spark/employeejson/*
{"Name":"Vinayak","age":35}
{"Name":"Nilesh","age":37}
{"Name":"Raju","age":30}
{"Name":"Karthik","age":28}
{"Name":"Shreshta","age":1}
{"Name":"Siddhish","age":2}
Add below jar(execute only if you get any error. )
hive> ADD JAR /usr/lib/hive-hcatalog/lib/hive-hcatalog-core.jar;
hive>
CREATE TABLE employeefromjson(name string, age int)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION '/user/cloudera/hive/employeefromjson'
;
hive> LOAD DATA INPATH '/user/cloudera/spark/employeejson' OVERWRITE INTO TABLE employeefromjson;
hive> select * from employeefromjson;
OK
Vinayak 35
Nilesh 37
Raju 30
Karthik 28
Shreshta 1
Siddhish 2
Time taken: 0.174 seconds, Fetched: 6 row(s)

A JSON should always start with '{' and not with '['. That is the problem. As you know, JSON has a structure of {'key':'value'}. What you have given in your file is a value which does not have any key. So, change your JSON to the below formmat
{"my_array":[{"col1":"col1","col2":1},{"col1":"col11","col2":11},{"col1":"col111","col2":2}]}
Your Create table statement should work fine.
If you want to get the data for each column for all the rows, use the below query.
select my_array.col1, my_array.col2 from my_table;
The above command will give you the below result.
OK
["col1","col11","col111"] [1,11,2]
If you want to get the result column wise for each row seperately, use the below query.
select a.* from my_table m lateral view outer inline (m.my_array) a;
The above command will give you the below result.
OK
col1 1
col11 11
col111 2
Hope you this helps!

Related

Athena create table from CSV file (S3 bucket) with semicolon

I'm trying to create a table with a S3 bucket which has an CSV file. Because of the regional settings the CSV has semicolon as a seperator and one row even contains commas.
Input CSV file:
Name;Phone;CRM;Desk;Rol
First Name;f.name;Name, First;IT;Inbel
First2 Name2;f2.name2;Name2, First2;IT;Inbel
First3 Name3;f3.name3;Name3, First3;IT;Inbel
First4 Name4;f4.name4;Name4, First4;IT;Inbel
Athena query:
CREATE EXTERNAL TABLE IF NOT EXISTS `a`.`test` (
`Name` string,
`Phone` string,
`CRM` string,
`Desk` string,
`Rol` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://***/test/'
TBLPROPERTIES ('has_encrypted_data'='false');
The output comes out as:
Name;Phone;CRM;Desk;Rol
First Name;f.name;Name First;IT;Inbel
First2 Name2;f2.name2;Name2 First2;IT;Inbel
First3 Name3;f3.name3;Name3 First3;IT;Inbel
First4 Name4;f4.name4;Name4 First4;IT;Inbel
I tried scanning the web for solutions (especially for the seperator), but nothing seems to work. I don't want to change regional settings and would love to keep the input file as is. Also if someone knows the solution for the CRM column it would be a bonus!

Loading JSON file in HIVE table

I have a JSON file like below, which I want to load in a HIVE table with parsed format, what are possible options I can go for.
If it is AVRO then I could have used directly AvroSerDe. But the source file in this case is JSON.
{
"subscriberId":"vfd1234-07e1-4054-9b64-83a5a20744db",
"cartId":"1234edswe-6a9c-493c-bcd0-7fb71995beef",
"cartStatus":"default",
"salesChannel":"XYZ",
"accountId":"12345",
"channelNumber":"12",
"timestamp":"Dec 12, 2013 8:30:00 AM",
"promotions":[
{
"promotionId":"NEWID1234",
"promotionContent":{
"has_termsandconditions":[
"TC_NFLMAXDEFAULT16R103578"
],
"sequenceNumber":"305",
"quantity":"1",
"promotionLevel":"basic",
"promotionDuration":"1",
"endDate":"1283142400000",
"description":"Regular Season One Payment",
"active":"true",
"disableInOfferPanel":"true",
"displayInCart":"true",
"type":"promotion",
"frequencyOfCharge":"weekly",
"promotionId":"NEWID1234",
"promotionIndicator":"No",
"shoppingCartTitle":"Regular Season One Payment",
"discountedPrice":"0",
"preselectedInOfferPanel":"false",
"price":"9.99",
"name":"Regular Season One Payment",
"have":[
"CatNFLSundayMax"
],
"ID":"NEWID1234",
"startDate":"1451365600000",
"displayInOfferPanel":"true"
}
}
]
}
I did tried to create a table using org.openx.data.jsonserde.JsonSerDe, but it is not showing me the data.
CREATE EXTERNAL TABLE test1
(
SUBSCRIBER_ID string,
CART_ID string,
CART_STAT_NAME string,
SLS_CHAN_NAME string,
ACCOUNT_ID string,
CHAN_NBR string,
TX_TMSTMP string,
PROMOTION ARRAY<STRING>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '<HDFS location where the json file is place in single line>';
Not sure about the JsonSerDe you are using . Bu here this JsonSerDe you can use for you.Hive-JSON-Serde
hive> add jar /User/User1/json-serde-1.3.8-SNAPSHOT-jar-with-dependencies.jar;
Added [/User/User1/json-serde-1.3.8-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [/User/User1/json-serde-1.3.8-SNAPSHOT-jar-with-dependencies.jar]
hive> use default;
OK
Time taken: 0.021 seconds
hive> CREATE EXTERNAL TABLE IF NOT EXISTS json_poc (
> alertHistoryId bigint, entityId bigint, deviceId string, alertTypeId int, AlertStartDate string
> )
> ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> LOCATION '/User/User1/sandeep_poc/hive_json';
OK
Time taken: 0.077 seconds
hive> select * from json_poc;
OK
123456 123 123 1 jan 04, 2017 2:46:48 PM
Time taken: 0.052 seconds, Fetched: 1 row(s)
How to build jar.
Maven should be installed on your PC then run command like this.
C:\Users\User1\Downloads\Hive-JSON-Serde-develop\Hive-JSON-Serde-develop>mvn -Phdp23 clean package.
-Phdp23 is hdp2.3 it should be replaced with your hadoop version.
Or if you want to use inbuilt JsonSerde get_json_object json_tuple
if you are looking for an example how to use see this blog Hive-JSON-Serde example .
I will recommend validate your JSON file as well.JSON Validator
If you read the official document
when you are using hive 0.12 and later, use hive-hcatalog-core,
Note: For Hive releases prior to 0.12, Amazon provides a JSON SerDe available at s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar.
you should first add the jar hive-hcatalog-core,
ADD JAR /path/to/jar/;
you can either download it from mvn repository or find it manually.
then the hive table should look like
CREATE EXTERNAL TABLE test1
(
SUBSCRIBER_ID string,
CART_ID string,
CART_STAT_NAME string,
SLS_CHAN_NAME string,
ACCOUNT_ID string,
CHAN_NBR string,
TX_TMSTMP string,
PROMOTION ARRAY<STRING>
)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION '<HDFS location where the json file is place in single line>';
Steps to load JSON file data in hive table
1] Create table in hive
hive> create table JsonTableExample(data string);
2] Load JSON file into a hive table
hive> load data inpath '/home/cloudera/testjson.json' into table JsonTableExample;
3] If we apply normal select * from JsonTableExample; we will get all data. This is not an effective solution for that we have to follow step 4.
4] Select data using get_json_object() function
hive> select get_json_object(data,'$.id') as id,
get_json_object(data,'$.name') as name from JsonTableExample;
For many versions of Hive, perhaps the best way to enable JSON processing is using org.apache.hive.hcatalog.data.JsonSerDe as previously mentioned. This is the out-of-the-box capability. However, for some versions of CDH6 and HDP3, there is a new feature where JSON is a first-class citizen. This exists in Apache Hive 4.0 and higher.
CREATE TABLE ... STORED AS JSONFILE;
Please note that each JSON object must be on its own line (without line breaks).
{"name"="john","age"=30}
{"name"="sue","age"=32}
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

Hive json serde - Keys with white spaces not populating

I have an external table that is built off of a json file.
All of the json keys are columns and are populated as expected except for one key that has a space.
Here is the DDL:
CREATE EXTERNAL TABLE foo.bar
( event ARRAY <STRUCT
value:STRING
,info:STRUCT
<id:STRING
,event_source:STRING>>
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES("mapping.event_source"="event source")
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'foo/bar'
All of the values show up as expected except for event_source, which shows up as NULL. The original form of event_source in the json file is 'event source' without the single quotes.
Is there something I need to do different with the WITH SERDEPROPERTIES setting in order to get the key to work properly?
Thanks
you mean that the json has data like
{ id: "myid", event source: "eventsource" }
If so, there's not much that can be done since it's simply broken JSON.
If not, can you post a sample of the JSON you're trying to read ?
I have encountered a similar problem as above but with a slight variation that the input data is correct json.
I have an external table that is built off of a json file. All of the json keys are populated except one msrp_currency
Here is the DDL:
CREATE EXTERNAL TABLE foo.bar
( id string,
variants array<struct<pid:string, msrp_currency:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( "ignore.malformed.json" = "true" ,
'mapping.variants.msrp_currency' = 'variants.msrpcurrency')
LOCATION 'foo/bar'
All of the values show up as expected except for msrp_currency, which shows up as NULL. The reason I need to introduce underscore is because later I need to extract the same field value as msrpCurrecny using brickhouse to_json UDF.
sample values:
{ "pid": "mypid", "msrpCurrency": "USD" }

issue with Hive Serde dealing nested structs

I am trying to load a huge volume json data with nested structure to hive using a Json serde. some of the field names start with $ in nested structure. I am mapping hive filed names Using SerDeproperties, but how ever when i query the table, getting null in the field starting with $, tried with different syntax,but no luck.
Sample JSON:
{
"_id" : "319FFE15FF90",
"SomeThing" :
{
"$SomeField" : 22,
"AnotherField" : 2112,
"YetAnotherField": 1
}
. . . etc . . . .
Using a schema as follows:
create table testSample
(
`_id` string,
something struct
<
$somefield:int,
anotherfield:bigint,
yetanotherfield:int
>
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.somefield" = "$somefield"
);
This schema builds OK, however, somefield(starting with $) in the above table is always returning null (all the other values exist and are correct).
We've been trying a lot of syntax combinations, but to no avail.
Does anyone know the trick to hap a nested field with a leading $ in its name?
You almost got it right. Try creating the table like this.
The mistake you're making is that when mapping in the serde properties (mapping.somefield ="$somefield") you're saying "when looking for the hive column named 'somefield', look for the json field '$somefield', but in hive you defined the column with the dollar sign, which if not outright illegal it's for sure not the best practice in hive.
create table testSample
(
`_id` string,
something struct
<
somefield:int,
anotherfield:bigint,
yetanotherfield:int
>
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.somefield" = "$somefield"
);
I tested it with some test data:
{ "_id" : "123", "something": { "$somefield": 12, "anotherfield":13,"yetanotherfield":100}}
hive> select something.somefield from testSample;
OK
12
I am suddenly starting to see this problem as well but for normal column names as well (no special characters such as $)
I am populating an external table (Temp) from another internal table (Table2) and want the output of Temp table in JSON format. I want column names in camel case in the output JSON file and so am also using the Serdepoperties in the Temp table to specify correct names. However, I am seeing that when I do Select * from the Temp table, it gives NULL values for the columns whose names have been used in the mapping.
I am running Hive 0.13. Here are the commands:
Create table command:
CREATE EXTERNAL TABLE Temp (
data STRUCT<
customerId:BIGINT, region:STRING, marketplaceId:INT, asin:ARRAY<STRING>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'mapping.customerid' = 'customerId',
'mapping.marketplaceid' = 'marketplaceId'
)
LOCATION '/output';
INSERT INTO TABLE Temp
SELECT
named_struct ('customerId',customerId, 'region', region, 'marketplaceId', marketplaceId, 'asin', asin)
FROM Table2;
Select * from Temp:
{"customerid":null,"region":"EU","marketplaceid":null,"asin":["B000FC1PZC"]}
{"customerid":null,"region":"EU","marketplaceid":null,"asin":["B000FC1C9G"]}
See how "customerid" and "marketplaceid" are null. Generated JSON file is:
{"data":{"region":"EU","asin":["B000FC1PZC"]}}
{"data":{"region":"EU","asin":["B000FC1C9G"]}}
Now, if I remove the with serdeproperties, the table starts getting all values:
{"customerid":1,"region":"EU","marketplaceid":4,"asin":["B000FC1PZC"]}
{"customerid":2,"region":"EU","marketplaceid":4,"asin":["B000FC1C9G"]}
And then the JSON file so generated is:
{"data":{"region":"EU","marketplaceid":4,"asin":["B000FC1PZC"],"customerid":1}}
{"data":{"region":"EU","marketplaceid":4,"asin":["B000FC1C9G"],"customerid":2}}

Hive: parsing JSON

I am trying to get some values out of nested JSON for millions of rows (5 TB+ table). What is the most efficient way to do this?
Here is an example:
{"country":"US","page":227,"data":{"ad":{"impressions":{"s":10,"o":10}}}}
I need these values out of the above JSON:
Country Page impressions_s impressions_o
--------- ----- ------------- --------------
US 2 10 10
This is Hive's json_tuple function, I am not sure if this is the best function.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-getjsonobject
You can use get_json_object:
select get_json_object(fieldname, '$.country'),
get_json_object(fieldname, '$.data.ad.s') from ...
You will get better performance with json_tuple but I found a "how to" to get the values in json inside json;
To formating your table you can use something like this:
from table t lateral view
explode( split(regexp_replace(get_json_object(ln, ''$.data.ad.s'), '\\[|\\]', ''), ',' ) ) tb1 as s
this code above will transform you "Array" in a column.
form more: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
I hope this help ...
Here is what you can quickly try , I would suggest to use Json-Ser-De.
nano /tmp/hive-parsing-json.json
{"country":"US","page":227,"data":{"ad":{"impressions":{"s":10,"o":10}}}}
Create base table :
hive > CREATE TABLE hive_parsing_json_table ( json string );
Load json file to Table :
hive > LOAD DATA LOCAL INPATH '/tmp/hive-parsing-json.json' INTO TABLE hive_parsing_json_table;
Query the table :
hive > select v1.Country, v1.Page, v4.impressions_s, v4.impressions_o
from hive_parsing_json_table hpjp
LATERAL VIEW json_tuple(hpjp.json, 'country', 'page', 'data') v1
as Country, Page, data
LATERAL VIEW json_tuple(v1.data, 'ad') v2
as Ad
LATERAL VIEW json_tuple(v2.Ad, 'impressions') v3
as Impressions
LATERAL VIEW json_tuple(v3.Impressions, 's' , 'o') v4
as impressions_s,impressions_o;
Output :
v1.country v1.page v4.impressions_s v4.impressions_o
US 227 10 10
Using hive native json-serde('org.apache.hive.hcatalog.data.JsonSerDe') you can do this.. here are the steps
ADD JAR /path/to/hive-hcatalog-core.jar;
create a table as below
CREATE TABLE json_serde_nestedjson (
country string,
page int,
data struct < ad: struct < impressions: struct < s:int, o:int > > >
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
then load data(stored in file)
LOAD DATA LOCAL INPATH '/tmp/nested.json' INTO TABLE json_serde_nestedjson;
then get required data using
SELECT country, page, data.ad.impressions.s, data.ad.impressions.o
FROM json_serde_nestedjson;
Implementing a SerDe to parse your data in JSON is a better way for your case.
A tutorial on how to implement SerDe for parsing JSON can be found here
http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/
You can use the following sample SerDe implementation as well
https://github.com/rcongiu/Hive-JSON-Serde