Twitter Json data not getting queried in Hive - json

I trying to do twitter sentiment analysis using Flume, Hadoop and Hive.
I am following this article . I was able to get tweets to HDFS successfully by using Flume. This is my Twitter-agent configuration.
#setting properties of agent
Twitter-agent.sources=source1
Twitter-agent.channels=channel1
Twitter-agent.sinks=sink1
#configuring sources
Twitter-agent.sources.source1.type=com.cloudera.flume.source.TwitterSource
Twitter-agent.sources.source1.channels=channel1
Twitter-agent.sources.source1.consumerKey=<consumer-key>
Twitter-agent.sources.source1.consumerSecret=<consumer-secret>
Twitter-agent.sources.source1.accessToken=<access-token>
Twitter-agent.sources.source1.accessTokenSecret=<Access-Token-secret>
Twitter-agent.sources.source1.keywords= morning, night, hadoop, bigdata
#configuring channels
Twitter-agent.channels.channel1.type=memory
Twitter-agent.channels.channel1.capacity=10000
Twitter-agent.channels.channel1.transactionCapacity=100
#configuring sinks
Twitter-agent.sinks.sink1.channel=channel1
Twitter-agent.sinks.sink1.type=hdfs
Twitter-agent.sinks.sink1.hdfs.path=flume/tweets
Twitter-agent.sinks.sink1.rollSize=0
Twitter-agent.sinks.sink1.rollCount=10000
Twitter-agent.sinks.sink1.batchSize=1000
Twitter-agent.sinks.sink1.fileType=DataStream
Twitter-agent.sinks.sink1.writeFormat=Text
Then I created a table just like in the article and the table was created. When I query the table,it gives error like this
hive> select * from tweets;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('S' (code 83)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.StringReader#31228d83; line: 1, column: 2]
Time taken: 0.914 seconds
I tried other queries like select count(id) from tweets but it shows a lot of errors.
This is the one of FlumeData file(tweets) present in HDFS.
SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable;�#z_�>��<���N ����{"in_reply_to_status_id_str":"613363183034601472","in_reply_to_status_id":613363183034601472,"created_at":"Tue Jun 23 15:09:32 +0000 2015","in_reply_to_user_id_str":"604605328","source":"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone<\/a>","retweet_count":0,"retweeted":false,"geo":null,"filter_level":"low","in_reply_to_screen_name":"AlexiBlue","id_str":"613363262760034304","in_reply_to_user_id":604605328,"favorite_count":0,"id":613363262760034304,"text":"#AlexiBlue good morning ☺️","place":null,"lang":"en","favorited":false,"possibly_sensitive":false,"coordinates":null,"truncated":false,"timestamp_ms":"1435072172237","entities":{"urls":[],"hashtags":[],"user_mentions":[{"indices":[0,10],"screen_name":"AlexiBlue","id_str":"604605328","name":"Alexi Blue ★","id":604605328}],"trends":[],"symbols":[]},"contributors":null,"user":{"utc_offset":null,"friends_count":1175,"profile_image_url_https":"https://pbs.twimg.com/profile_images/604664190763212800/Nmqxn_p5_normal.jpg","listed_count":6,"profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","default_profile_image":false,"favourites_count":31695,"description":"PIZZA & TACOS ARE LIFE. #flippinfamily #rudunation #ABNation #5quad #7squad #SamCollinsisbaeaf","created_at":"Sun Mar 09 02:40:15 +0000 2014","is_translator":false,"profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","protected":false,"screen_name":"Sonja_Campbell1","id_str":"2379671544","profile_link_color":"3B94D9","id":2379671544,"geo_enabled":true,"profile_background_color":"C0DEED","lang":"en","profile_sidebar_border_color":"C0DEED","profile_text_color":"333333","verified":false,"profile_image_url":"http://pbs.twimg.com/profile_images/604664190763212800/Nmqxn_p5_normal.jpg","time_zone":null,"url":null,"contributors_enabled":false,"profile_background_tile":false,"profile_banner_url":"https://pbs.twimg.com/profile_banners/2379671544/1434956813","statuses_count":17254,"follow_request_sent":null,"followers_count":871,"profile_use_background_image":true,"default_profile":false,"following":null,"name":"Sonita✨","location":"","profile_sidebar_fill_color":"DDEEF6","notifications":null}}
Can anyone help me with this?

SerDe is short for Serializer/Deserializer. Hive uses the SerDe interface for IO. And json is one format supported among many. I can see the serde exception and json in your error message. So its something to do with marshalling and unmarshalling of json data present in hive table column. Identify into which column are you adding json data. Happy Coding

Download "hive-json-serde.jar" and Add it to hive shell before you query any table which contains SerDe data such json, etc.
You will have to do this everytime you open Hive shell.

You need to download and add hive-serdes-1.0-SNAPSHOT.jar in your hive shell which has the JSON serde by clodera. Then you need to create a table based on your required columns.
For example
create external table load_tweets(id BIGINT,text STRING) ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' LOCATION '/user/flume/tweets'
To perform sentiment analysis tweet_id and the tweeted_text are enough. Now if you use
select * from load_tweets;
Then you can see the data in your hive table containing tweet_id and the tweet_text.
You can refer to the below link in which sentiment analysis has been clearly explained with screen shots.
https://acadgild.com/blog/sentiment-analysis-on-tweets-with-apache-hive-using-afinn-dictionary/

Related

Hive 3.x causing error for compressed (bz2) json in external table

I have some JSON data (about 60GB) that I have to load in Hive external table. I am using Hive 3.x with Hadoop 3.x. The schema of table is as follows:
CREATE TABLE people(a string, liid string, link string, n string, t string, e string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'
STORED AS TEXTFILE LOCATION '/data/db/';
I have also loaded the jar for serde as follows:
ADD JAR /usr/hive/lib/hive-hcatalog-core-3.1.2.jar;
If I copy a simple text json (or load) then DML queries (select etc.) works fine. As data file is very large and thus I have compressed it (20GB now). I have loaded this compressed file into Hive table (created above).
hive> select * from people;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Field name expected
Time taken: 0.096 seconds
hive>
It is working fine with uncompressed data. What is the issue with this ?
I have tried some solutions like this but not successful
I found the solution myself. Actual the issue was there are two columns that are arrays in json. They should be mapped to ARRAY in hive. The sample I taken for schema did not contain these array. Hence, by changing the field type to array<<string>> for one column solved my issue.

create Hive table for nested JSON data

I am not able to load nested JSON data into Hive table. Could someone please help me? Below is what I have tried:
Sample Input:
{"DocId":"ABC","User1":{"Id":1234,"Username":"sam1234","Name":"Sam","ShippingAddress":{"Address1":"123 Main St.","Address2":null,"City":"Durham","State":"NC"},"Orders":[{"ItemId":6789,"OrderDate":"11/11/2012"},{"ItemId":4352,"OrderDate":"12/12/2012"}]}}
On Hive (CDH3):
ADD JAR /usr/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar;
CREATE TABLE json_tab(
DocId string,
user1 struct<Id: int, Username: string, Name:string,ShippingAddress:struct<address1:string,address2:string,city:string,state:string>,orders:array<struct<ItemId:int,orderdate:string>>>
)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
STORED AS TEXTFILE;
hive> select * from json_tab;
OK
NULL null
I am getting NULLs here.
Also tried with HCatalog jar:
ADD JAR /home/training/Desktop/hcatalog-core-0.11.0.jar;
CREATE TABLE json_tab(
DocId string,
user1 struct<Id: int, Username: string, Name:string,ShippingAddress:struct<address1:string,address2:string,city:string,state:string>,orders:array<struct<ItemId:int,orderdate:string>>>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
But facing below error with my create table statement:
FAILED: Error in metadata: Cannot validate serde:
org.apache.hive.hcatalog.data.JsonSerDe FAILED: Execution Error,
return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
Could someone please help me? Thanks for your help in advance.
you can use org.openx.data.jsonserde.JsonSerDe class to rad the json data
you can download jar file from http://www.congiu.net/hive-json-serde/1.3.6-SNAPSHOT/cdh4/
and do following steps
add jar /path/to/jar/json-serde-1.3.6-jar-with-dependencies.jar;
CREATE TABLE json_tab(
DocId string,
user1 struct<Id: int, Username: string, Name:string,ShippingAddress:struct<address1:string,address2:string,city:string,state:string>,orders:array<struct<ItemId:int,orderdate:string>>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
LOAD DATA LOCAL INPATH '/path/to/data/nested.json' INTO TABLE json_tab;
SELECT DocId, User1.Id, User1.ShippingAddress.City as city,
User1.Orders[0].ItemId as order0id,
User1.Orders[1].ItemId as order1id from json_tab;
result
ABC 1234 Durham 6789 4352
I was getting same exception.
I added following jars and it worked for me.
ADD JAR /home/cloudera/Data/json-serde-1.3.7.3.jar;
ADD JAR /home/cloudera/Data/hive-hcatalog-core-0.13.0.jar;
Using HiveQL to analyse JSON files require either org.openx.data.jsonserde.JsonSerDe or org.apache.hive.hcatalog.data.JsonSerDe to work correctly.
org.apache.hive.hcatalog.data.JsonSerDe
This is the default JSON SerDe from Apache. This is commonly used to process JSON data like events. These events are represented as blocks of JSON-encoded text separated by a new line. The Hive JSON SerDe does not allow duplicate keys in map or struct key names.
org.openx.data.jsonserde.JsonSerDe
OpenX JSON SerDe is similar to native Apache; however, it offers multiple optional properties such as "ignore.malformed.json", "case.insensitive", and many more. In my opinion, it usually works better when dealing with nested JSON files.
See the working example below:
CREATE EXTERNAL TABLE IF NOT EXISTS `dbname`.`tablename` (
`DocId` STRING,
`User1` STRUCT<
`Id`:INT,
`Username`:STRING,
`Name`:STRING,
`ShippingAddress`:STRUCT<
`Address1`:STRING,
`Address2`:,
`City`:STRING,
`State`:STRING>,
`Orders`:STRUCT<
`ItemId`:INT,
`OrderDate`:STRING>>)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
LOCATION
's3://awsexamplebucket1-logs/AWSLogs/'
Create table statement generated from: https://www.hivetablegenerator.com/

Athena AWS bad field name and multiple folders with Hive DDL

I'm new into AWS Athena, and I'm trying to query multiple S3 buckets containing JSON files. I encountered a number of problems that don't have any answer in documentation (sadly their error log is not informative enough to try to solve it myself):
How to query a JSON field named with parenthesis? For example I have a field named "Capacity(GB)", and when I'm trying to include in the CREATE EXTERNAL statement I receive an error:
CREATE EXTERNAL TABLE IF NOT EXISTS test-scema.test_table (
`device`: string,
`Capacity(GB)`: string)
Your query has the following error(s):
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
java.lang.IllegalArgumentException: Error: : expected at the position
of 'Capacity(GB):string>' but '(' is found.
My files are located in sub folders in S3 in a following structure:
'location_name/YYYY/MM/DD/appstring/'
and I want to query all the dates of a specific app-string (out of many). is there any 'wildcard' I can use to replace the dates path?
Something like this:
LOCATION 's3://location_name/%/%/%/appstring/'
Do I have to load the raw data as-is using CREATE EXTERNAL TABLE, and only then query it, or I can add some WHERE statements build-in? Specifically is someting like this is possible:
CREATE EXTERNAL TABLE IF NOT EXISTS test_schema.test_table (
field1:string,
field2:string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://folder/YYYY/MM/DD/appstring'
WHERE field2='value'
What would be the outcomes in terms of billing? Cause right now I'm building this CREATE statement only to re-use the data in a SQL query once-again.
Thanks!
1. JSON field named with parenthesis
There is no need to create a field called Capacity(GB). Instead, create the field with a different name:
CREATE EXTERNAL TABLE test_table (
device string,
capacity string
)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties ( 'paths'='device,Capacity(GB)')
LOCATION 's3://xxx';
If you are using Nested JSON then you can use the Serde's mapping property (which I saw on issue with Hive Serde dealing nested structs):
CREATE external TABLE test_table (
top string,
inner struct<device:INT,
capacity:INT>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.capacity" = "Capacity(GB)"
)
LOCATION 's3://xxx';
This works nicely with an input of:
{ "top" : "123", "inner": { "Capacity(GB)": 12, "device":2}}
2. Subfolders
You cannot wildcard mid-path (s3://location_name/*/*/*/appstring/). The closest option is to use partitioned data but that would require a different naming format for your directories.
3. Creating tables
You cannot specify WHERE statements as part of the CREATE TABLE statement.
If your aim is to reduce data costs, then use partitioned data to reduce the number of files scanned or store in a column-based format such as Parquet.
For examples, see: Analyzing Data in S3 using Amazon Athena

Hive external table with JSON SerDe fetching all NULL values

My data is stored in HDFS at directory /tmp/kafka/alert in multiple files. Each file contain new-line separated JSON objects like following.
{"alertHistoryId":123456,"entityId":123,"deviceId":"123","alertTypeId":1,"AlertStartDate":"Dec 28, 2016 12:05:48 PM"}
{"alertHistoryId":123456,"entityId":125,"deviceId":"125","alertTypeId":5,"AlertStartDate":"Dec 28, 2016 11:58:48 AM"}
I added hive JSON SerDe jar using below
ADD JAR /usr/local/downloads/hive-serdes-1.0-SNAPSHOT.jar;
I created table with following
CREATE EXTERNAL TABLE IF NOT EXISTS my_alert (
alertHistoryId bigint, entityId bigint, deviceId string, alertTypeId int, AlertStartDate string
)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/tmp/kafka/alert';
table created successfully. But when I fetched data, I got all null values. Anyone got any idea how to resolve this?
Dont use Serde Adding Jar and converting those is always overhead.Rather than you can read the JSON using inbuilt get_json_object and json_tuple .if you are looking for an example how to use see this blog querying-json-records-via-hive
If you wanted to use JSON Serde only then have a look on this Hive-JSON-Serde. Before test it out first of all validate the JSON Validator.
You are using old version of JSON Serde. There might be an issue with your JSON Serde and Hadoop Distribution.
Please find below link to get new version of Json Serde. Follow the steps from the link to build it according to your Hadoop distribution.
https://github.com/rcongiu/Hive-JSON-Serde
Please see below working example.
hive> add jar /User/User1/json-serde-1.3.8-SNAPSHOT-jar-with-dependencies.jar;
Added [/User/User1/json-serde-1.3.8-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [/User/User1/json-serde-1.3.8-SNAPSHOT-jar-with-dependencies.jar]
hive> use default;
OK
Time taken: 0.021 seconds
hive> CREATE EXTERNAL TABLE IF NOT EXISTS json_poc (
> alertHistoryId bigint, entityId bigint, deviceId string, alertTypeId int, AlertStartDate string
> )
> ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
> LOCATION '/User/User1/sandeep_poc/hive_json';
OK
Time taken: 0.077 seconds
hive> select * from json_poc;
OK
123456 123 123 1 Dec 28, 2016 12:05:48 PM
123456 125 125 5 Dec 28, 2016 11:58:48 AM
Time taken: 0.052 seconds, Fetched: 2 row(s)
hive>
How to build jar.
Maven should be installed on your PC then run command like this.
C:\Users\User1\Downloads\Hive-JSON-Serde-develop\Hive-JSON-Serde-develop>mvn -Phdp23 clean package
In my case I am using hdp2.3 so I have provided -Phdp23
Hope it will help if you are willing to use Hive JSON Serde.

Convert file of JSON objects to Parquet file

Motivation: I want to load the data into Apache Drill. I understand that Drill can handle JSON input, but I want to see how it performs on Parquet data.
Is there any way to do this without first loading the data into Hive, etc and then using one of the Parquet connectors to generate an output file?
Kite has support for importing JSON to both Avro and Parquet formats via its command-line utility, kite-dataset.
First, you would infer the schema of your JSON:
kite-dataset json-schema sample-file.json -o schema.avsc
Then you can use that file to create a Parquet Hive table:
kite-dataset create mytable --schema schema.avsc --format parquet
And finally, you can load your JSON into the dataset.
kite-dataset json-import sample-file.json mytable
You can also import an entire directly stored in HDFS. In that case, Kite will use a MR job to do the import.
You can actually use Drill itself to create a parquet file from the output of any query.
create table student_parquet as select * from `student.json`;
The above line should be good enough. Drill interprets the types based on the data in the fields. You can substitute your own query and create a parquet file.
To complete the answer of #rahul, you can use drill to do this - but I needed to add more to the query to get it working out of the box with drill.
create table dfs.tmp.`filename.parquet` as select * from dfs.`/tmp/filename.json` t
I needed to give it the storage plugin (dfs) and the "root" config can read from the whole disk and is not writable. But the tmp config (dfs.tmp) is writable and writes to /tmp. So I wrote to there.
But the problem is that if the json is nested or perhaps contains unusual characters, I would get a cryptic
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: java.lang.IndexOutOfBoundsException:
If I have a structure that looks like members: {id:123, name:"joe"} I would have to change the select to
select members.id as members_id, members.name as members_name
or
select members.id as `members.id`, members.name as `members.name`
to get it to work.
I assume the reason is that parquet is a "column" store so you need columns. JSON isn't by default so you need to convert it.
The problem is I have to know my json schema and I have to build the select to include all the possibilities. I'd be happy if some knows a better way to do this.