Hive external table with JSON SerDe fetching all NULL values - json

My data is stored in HDFS at directory /tmp/kafka/alert in multiple files. Each file contain new-line separated JSON objects like following.
{"alertHistoryId":123456,"entityId":123,"deviceId":"123","alertTypeId":1,"AlertStartDate":"Dec 28, 2016 12:05:48 PM"}
{"alertHistoryId":123456,"entityId":125,"deviceId":"125","alertTypeId":5,"AlertStartDate":"Dec 28, 2016 11:58:48 AM"}
I added hive JSON SerDe jar using below
ADD JAR /usr/local/downloads/hive-serdes-1.0-SNAPSHOT.jar;
I created table with following
CREATE EXTERNAL TABLE IF NOT EXISTS my_alert (
alertHistoryId bigint, entityId bigint, deviceId string, alertTypeId int, AlertStartDate string
)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/tmp/kafka/alert';
table created successfully. But when I fetched data, I got all null values. Anyone got any idea how to resolve this?

Dont use Serde Adding Jar and converting those is always overhead.Rather than you can read the JSON using inbuilt get_json_object and json_tuple .if you are looking for an example how to use see this blog querying-json-records-via-hive
If you wanted to use JSON Serde only then have a look on this Hive-JSON-Serde. Before test it out first of all validate the JSON Validator.

You are using old version of JSON Serde. There might be an issue with your JSON Serde and Hadoop Distribution.
Please find below link to get new version of Json Serde. Follow the steps from the link to build it according to your Hadoop distribution.
https://github.com/rcongiu/Hive-JSON-Serde
Please see below working example.
hive> add jar /User/User1/json-serde-1.3.8-SNAPSHOT-jar-with-dependencies.jar;
Added [/User/User1/json-serde-1.3.8-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [/User/User1/json-serde-1.3.8-SNAPSHOT-jar-with-dependencies.jar]
hive> use default;
OK
Time taken: 0.021 seconds
hive> CREATE EXTERNAL TABLE IF NOT EXISTS json_poc (
> alertHistoryId bigint, entityId bigint, deviceId string, alertTypeId int, AlertStartDate string
> )
> ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> LOCATION '/User/User1/sandeep_poc/hive_json';
OK
Time taken: 0.077 seconds
hive> select * from json_poc;
OK
123456 123 123 1 Dec 28, 2016 12:05:48 PM
123456 125 125 5 Dec 28, 2016 11:58:48 AM
Time taken: 0.052 seconds, Fetched: 2 row(s)
hive>
How to build jar.
Maven should be installed on your PC then run command like this.
C:\Users\User1\Downloads\Hive-JSON-Serde-develop\Hive-JSON-Serde-develop>mvn -Phdp23 clean package
In my case I am using hdp2.3 so I have provided -Phdp23
Hope it will help if you are willing to use Hive JSON Serde.

Related

How to deal with JSON with special characters in Column Names in AWS ATHENA

I'm new to athena even though I have some short experience with Hive.
I'm trying to create a table from JSON files, which are exports from MongoDB. My problem is that MongoDB uses $oid, $numberInt, $numberDoble and others as internal references, but '$' is not accepted in a column name in Athena.
This is a one line JSON file that I created to test:
{"_id":{"$oid":"61f87ebdf655d153709c9e19"}}
and this is the table that referes to it:
CREATE EXTERNAL TABLE landing.json_table (
`_id` struct<`$oid`:string>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://bucket-name/test/';
When I run a simple SELECT * it returns this error:
HIVE_METASTORE_ERROR: Error: name expected at the position 7 of
'struct<$oid:string>' but '$' is found. (Service: null; Status Code:
0; Error Code: null; Request ID: null; Proxy: null)
Which is related to the fact that the JSON column contains the $.
Any idea on how to handle the situation? My only resolution for now is to create a script which "clean" the json file from the unaccepted characters but I would really prefer to handle it directly in Athena if possible
If you switch to the OpenX SerDe, you can create a SerDe mapping for JSON fields with special characters like $ in the name.
See AWS Blog entry Create Tables in Amazon Athena from Nested JSON and Mappings Using JSONSerDe , section "Walkthrough: Handling forbidden characters with mappings".
A mapping that would work for your example:
CREATE EXTERNAL TABLE landing.json_table (
`_id` struct<`oid`:string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
"mapping.oid"="$oid"
)
LOCATION 's3://bucket-name/test/';

Hive 3.x causing error for compressed (bz2) json in external table

I have some JSON data (about 60GB) that I have to load in Hive external table. I am using Hive 3.x with Hadoop 3.x. The schema of table is as follows:
CREATE TABLE people(a string, liid string, link string, n string, t string, e string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'
STORED AS TEXTFILE LOCATION '/data/db/';
I have also loaded the jar for serde as follows:
ADD JAR /usr/hive/lib/hive-hcatalog-core-3.1.2.jar;
If I copy a simple text json (or load) then DML queries (select etc.) works fine. As data file is very large and thus I have compressed it (20GB now). I have loaded this compressed file into Hive table (created above).
hive> select * from people;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Field name expected
Time taken: 0.096 seconds
hive>
It is working fine with uncompressed data. What is the issue with this ?
I have tried some solutions like this but not successful
I found the solution myself. Actual the issue was there are two columns that are arrays in json. They should be mapped to ARRAY in hive. The sample I taken for schema did not contain these array. Hence, by changing the field type to array<<string>> for one column solved my issue.

Hive SerDe returns error with JSON tweets Flume

I am collecting twitter stream data using Flume and storing it in JSON format in HDFS. I am trying to use Hive SerDe to put this twitter data into Hive table but I am getting a very frustrating error.
hive> ADD JAR file:////home/ubuntu/hive/lib/hive-serdes-1.0-SNAPSHOT.jar;
Added [file:////home/ubuntu/hive/lib/hive-serdes-1.0-SNAPSHOT.jar] to class path
Added resources: [file:////home/ubuntu/hive/lib/hive-serdes-1.0-SNAPSHOT.jar]
hive> CREATE EXTERNAL TABLE tweet (
> id BIGINT,
> created_at STRING,
> source STRING,
> favorited BOOLEAN,
> text STRING,
> in_reply_to_screen_name STRING
> )
>
> ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
> LOCATION '/user/ubuntu/twitter/';
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org/apache/hadoop/hive/serde2/SerDe
Any help would be appreciated.
I had the same issue, however, I found a work around to solve the problem:
create table tweets(tweet string);
load data inpath 'home/hduser/test.json' into table tweets;
The only difference now you will need to use, get_json_object() to use the data.
Like below:
select get_json_object(tweet,'$.text') as tweet_text, get_json_object(tweet,'$.created_at') as created_at from tweets;
Reference

create Hive table for nested JSON data

I am not able to load nested JSON data into Hive table. Could someone please help me? Below is what I have tried:
Sample Input:
{"DocId":"ABC","User1":{"Id":1234,"Username":"sam1234","Name":"Sam","ShippingAddress":{"Address1":"123 Main St.","Address2":null,"City":"Durham","State":"NC"},"Orders":[{"ItemId":6789,"OrderDate":"11/11/2012"},{"ItemId":4352,"OrderDate":"12/12/2012"}]}}
On Hive (CDH3):
ADD JAR /usr/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar;
CREATE TABLE json_tab(
DocId string,
user1 struct<Id: int, Username: string, Name:string,ShippingAddress:struct<address1:string,address2:string,city:string,state:string>,orders:array<struct<ItemId:int,orderdate:string>>>
)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
STORED AS TEXTFILE;
hive> select * from json_tab;
OK
NULL null
I am getting NULLs here.
Also tried with HCatalog jar:
ADD JAR /home/training/Desktop/hcatalog-core-0.11.0.jar;
CREATE TABLE json_tab(
DocId string,
user1 struct<Id: int, Username: string, Name:string,ShippingAddress:struct<address1:string,address2:string,city:string,state:string>,orders:array<struct<ItemId:int,orderdate:string>>>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
But facing below error with my create table statement:
FAILED: Error in metadata: Cannot validate serde:
org.apache.hive.hcatalog.data.JsonSerDe FAILED: Execution Error,
return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
Could someone please help me? Thanks for your help in advance.
you can use org.openx.data.jsonserde.JsonSerDe class to rad the json data
you can download jar file from http://www.congiu.net/hive-json-serde/1.3.6-SNAPSHOT/cdh4/
and do following steps
add jar /path/to/jar/json-serde-1.3.6-jar-with-dependencies.jar;
CREATE TABLE json_tab(
DocId string,
user1 struct<Id: int, Username: string, Name:string,ShippingAddress:struct<address1:string,address2:string,city:string,state:string>,orders:array<struct<ItemId:int,orderdate:string>>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
LOAD DATA LOCAL INPATH '/path/to/data/nested.json' INTO TABLE json_tab;
SELECT DocId, User1.Id, User1.ShippingAddress.City as city,
User1.Orders[0].ItemId as order0id,
User1.Orders[1].ItemId as order1id from json_tab;
result
ABC 1234 Durham 6789 4352
I was getting same exception.
I added following jars and it worked for me.
ADD JAR /home/cloudera/Data/json-serde-1.3.7.3.jar;
ADD JAR /home/cloudera/Data/hive-hcatalog-core-0.13.0.jar;
Using HiveQL to analyse JSON files require either org.openx.data.jsonserde.JsonSerDe or org.apache.hive.hcatalog.data.JsonSerDe to work correctly.
org.apache.hive.hcatalog.data.JsonSerDe
This is the default JSON SerDe from Apache. This is commonly used to process JSON data like events. These events are represented as blocks of JSON-encoded text separated by a new line. The Hive JSON SerDe does not allow duplicate keys in map or struct key names.
org.openx.data.jsonserde.JsonSerDe
OpenX JSON SerDe is similar to native Apache; however, it offers multiple optional properties such as "ignore.malformed.json", "case.insensitive", and many more. In my opinion, it usually works better when dealing with nested JSON files.
See the working example below:
CREATE EXTERNAL TABLE IF NOT EXISTS `dbname`.`tablename` (
`DocId` STRING,
`User1` STRUCT<
`Id`:INT,
`Username`:STRING,
`Name`:STRING,
`ShippingAddress`:STRUCT<
`Address1`:STRING,
`Address2`:,
`City`:STRING,
`State`:STRING>,
`Orders`:STRUCT<
`ItemId`:INT,
`OrderDate`:STRING>>)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
LOCATION
's3://awsexamplebucket1-logs/AWSLogs/'
Create table statement generated from: https://www.hivetablegenerator.com/

Twitter Json data not getting queried in Hive

I trying to do twitter sentiment analysis using Flume, Hadoop and Hive.
I am following this article . I was able to get tweets to HDFS successfully by using Flume. This is my Twitter-agent configuration.
#setting properties of agent
Twitter-agent.sources=source1
Twitter-agent.channels=channel1
Twitter-agent.sinks=sink1
#configuring sources
Twitter-agent.sources.source1.type=com.cloudera.flume.source.TwitterSource
Twitter-agent.sources.source1.channels=channel1
Twitter-agent.sources.source1.consumerKey=<consumer-key>
Twitter-agent.sources.source1.consumerSecret=<consumer-secret>
Twitter-agent.sources.source1.accessToken=<access-token>
Twitter-agent.sources.source1.accessTokenSecret=<Access-Token-secret>
Twitter-agent.sources.source1.keywords= morning, night, hadoop, bigdata
#configuring channels
Twitter-agent.channels.channel1.type=memory
Twitter-agent.channels.channel1.capacity=10000
Twitter-agent.channels.channel1.transactionCapacity=100
#configuring sinks
Twitter-agent.sinks.sink1.channel=channel1
Twitter-agent.sinks.sink1.type=hdfs
Twitter-agent.sinks.sink1.hdfs.path=flume/tweets
Twitter-agent.sinks.sink1.rollSize=0
Twitter-agent.sinks.sink1.rollCount=10000
Twitter-agent.sinks.sink1.batchSize=1000
Twitter-agent.sinks.sink1.fileType=DataStream
Twitter-agent.sinks.sink1.writeFormat=Text
Then I created a table just like in the article and the table was created. When I query the table,it gives error like this
hive> select * from tweets;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('S' (code 83)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.StringReader#31228d83; line: 1, column: 2]
Time taken: 0.914 seconds
I tried other queries like select count(id) from tweets but it shows a lot of errors.
This is the one of FlumeData file(tweets) present in HDFS.
SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable;�#z_�>��<���N ����{"in_reply_to_status_id_str":"613363183034601472","in_reply_to_status_id":613363183034601472,"created_at":"Tue Jun 23 15:09:32 +0000 2015","in_reply_to_user_id_str":"604605328","source":"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone<\/a>","retweet_count":0,"retweeted":false,"geo":null,"filter_level":"low","in_reply_to_screen_name":"AlexiBlue","id_str":"613363262760034304","in_reply_to_user_id":604605328,"favorite_count":0,"id":613363262760034304,"text":"#AlexiBlue good morning ☺️","place":null,"lang":"en","favorited":false,"possibly_sensitive":false,"coordinates":null,"truncated":false,"timestamp_ms":"1435072172237","entities":{"urls":[],"hashtags":[],"user_mentions":[{"indices":[0,10],"screen_name":"AlexiBlue","id_str":"604605328","name":"Alexi Blue ★","id":604605328}],"trends":[],"symbols":[]},"contributors":null,"user":{"utc_offset":null,"friends_count":1175,"profile_image_url_https":"https://pbs.twimg.com/profile_images/604664190763212800/Nmqxn_p5_normal.jpg","listed_count":6,"profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","default_profile_image":false,"favourites_count":31695,"description":"PIZZA & TACOS ARE LIFE. #flippinfamily #rudunation #ABNation #5quad #7squad #SamCollinsisbaeaf","created_at":"Sun Mar 09 02:40:15 +0000 2014","is_translator":false,"profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","protected":false,"screen_name":"Sonja_Campbell1","id_str":"2379671544","profile_link_color":"3B94D9","id":2379671544,"geo_enabled":true,"profile_background_color":"C0DEED","lang":"en","profile_sidebar_border_color":"C0DEED","profile_text_color":"333333","verified":false,"profile_image_url":"http://pbs.twimg.com/profile_images/604664190763212800/Nmqxn_p5_normal.jpg","time_zone":null,"url":null,"contributors_enabled":false,"profile_background_tile":false,"profile_banner_url":"https://pbs.twimg.com/profile_banners/2379671544/1434956813","statuses_count":17254,"follow_request_sent":null,"followers_count":871,"profile_use_background_image":true,"default_profile":false,"following":null,"name":"Sonita✨","location":"","profile_sidebar_fill_color":"DDEEF6","notifications":null}}
Can anyone help me with this?
SerDe is short for Serializer/Deserializer. Hive uses the SerDe interface for IO. And json is one format supported among many. I can see the serde exception and json in your error message. So its something to do with marshalling and unmarshalling of json data present in hive table column. Identify into which column are you adding json data. Happy Coding
Download "hive-json-serde.jar" and Add it to hive shell before you query any table which contains SerDe data such json, etc.
You will have to do this everytime you open Hive shell.
You need to download and add hive-serdes-1.0-SNAPSHOT.jar in your hive shell which has the JSON serde by clodera. Then you need to create a table based on your required columns.
For example
create external table load_tweets(id BIGINT,text STRING) ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' LOCATION '/user/flume/tweets'
To perform sentiment analysis tweet_id and the tweeted_text are enough. Now if you use
select * from load_tweets;
Then you can see the data in your hive table containing tweet_id and the tweet_text.
You can refer to the below link in which sentiment analysis has been clearly explained with screen shots.
https://acadgild.com/blog/sentiment-analysis-on-tweets-with-apache-hive-using-afinn-dictionary/