yelp data set parsing json in hive - json

create external table review
(
business_id string,
user_id string,
stars Double,
text string,
date date,
votes struct <
vote_type :string ,
count: int >)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
;
table is loaded correctly but getting error when trying to parse stars and date field
on hive ie ..select stars from review is giving error.
data set used is used from the below link and is in json format
https://www.yelp.com/dataset_challenge

You should give pointer such as LOCATION '/user/ruchit31/god/' so that your table will point that location. Modify your create table query
create external table review
( business_id string,
user_id string,
stars Double,
text string,
date date,
votes struct < vote_type :string , count: int >
)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/path/'

Related

Excel Int to date in Hive

I ingested csv table to a Hive table but the date is being shown as integer value. Is there a way to convert Integer value (stored as string) to date in Hive?
When I do this
select cast(day_id2 as date) from table1
...I get null values:
Can someone tell me an elegant way to convert integer values (stored as string) into date values.

insert data into table using csv file in HIVE

CREATE TABLE `rk_test22`(
`index` int,
`country` string,
`description` string,
`designation` string,
`points` int,
`price` int,
`province` string,
`region_1` string,
`region_2` string,
`taster_name` string,
`taster_twitter_handle` string,
`title` string,
`variety` string,
`winery` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'input.regex'=',(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://namever/user/hive/warehouse/robert.db/rk_test22'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='1',
'skip.header.line.count'='1',
'totalSize'='52796693',
'transient_lastDdlTime'='1516088117');
I created the hive table using above command. Now I want to load the following line (in CSV file) into table using load data command. The load data command shows status OK but i cannot see data into that table.
0,Italy,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,#kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
If you are loading one line CSV file then that line is skipped because of this property: 'skip.header.line.count'='1'
Also Regex should contain one capturing group for each column. Like in this answer: https://stackoverflow.com/a/47944328/2700344
And why do you provide these settings in table DDL:
'COLUMN_STATS_ACCURATE'='true'
'numFiles'='1',
'totalSize'='52796693',
'transient_lastDdlTime'='1516088117'
All these should be set automatically after DDL and ANALYZE.

JSON to HIVE ingestion

add jar /path to/hive-serdes-1.0-SNAPSHOT.jar;
CREATE EXTERNAL TABLE student
( id int, student_id INT, type STRING, score DOUBLE
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES ( 'mongo.columns.mapping'='{ "id":"_id",
"student_id":"student_id", "type":"type","score":"score" }' )
TBLPROPERTIES('mongo.uri'='mongodb://****---****.nam.nsroot.net:*****/admin.student');
I am able to successfully run the code and ingest data. But the "id" field gets populated as NULL.
Should i change the data type ? I tried STRING as well. Got the same result.
According to the mongo-hadoop Hive SerDe, ObjectId corresponds to a special instance of STRUCT.
A Hive field corresponding to an ObjectId must be a STRUCT with the fields oid, a STRING, and bsontype, an INT, and nothing else. The oid is the string of the ObjectId while the bsontype should always be 8. Per your example, it should be :
CREATE EXTERNAL TABLE student
(id STRUCT<oid:STRING, bsontype:INT>, student_id INT, type STRING, score DOUBLE)
Where the output would be something similar to:
{"oid":"56d6e0f6ff1f17f74ebbc16c","bsontype":8}
{"oid":"56d6e0f8ff1f17f74ebbc16d","bsontype":8}
...
The above was tested with: MongoDB v3.2.x, mongo-java-driver-3.2.2.jar, mongo-hadoop-core-1.5.0-rc0.jar, mongo-hadoop-hive-1.5.0-rc0.jar.

hive insert into structure data type using a query

I have a use case where I have a table a. I want to select data from it, group by come fields, do some aggregations and insert the result into another hive table b having one of the column as a struct. I am facing some difficulty with it. Can some one please help and tell me whats wrong with my queries.
CREATE EXTERNAL TABLE IF NOT EXISTS a (
date string,
acct string,
media string,
id1 string,
val INT
) PARTITIONED BY (day STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 'folder1/folder2/';
ALTER TABLE a ADD IF NOT EXISTS PARTITION (day='{DATE}') LOCATION 'folder1/folder2/Date={DATE}';
CREATE EXTERNAL TABLE IF NOT EXISTS b (
date string,
acct string,
media string,
st1 STRUCT<id1:STRING, val:INT>
) PARTITIONED BY (day STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 'path/';
FROM a
INSERT OVERWRITE TABLE b PARTITION (day='{DATE}')
SELECT date,acct,media,named_struct('id1',id1,'val',sum(val))
WHERE day='{DATE}' and media is not null and acct is not null and NOT (id1 = "0" )
GROUP BY date,acct,media,id1;
Error I got :
SemanticException [Error 10044]: Line 3:31 Cannot insert into target table because column number/types are different ''2015-07-16'': Cannot convert column 4 from struct<id1:string,val:bigint> to struct<id1:string,val:int>.
Sum return a BIGINT, not an INT. So Declare
st1 STRUCT<id1:STRING, val:BIGINT>
instead of
st1 STRUCT<id1:STRING, val:INT>

Insert DataFrame into SQL table with AUTO_INCREMENT column

I have a MySQL table which includes a column that is AUTO_INCREMENT:
CREATE TABLE features (
id INT NOT NULL AUTO_INCREMENT,
name CHAR(30),
value DOUBLE PRECISION
);
I created a DataFrame and wanted to insert it into this table.
case class Feature(name: String, value: Double)
val rdd: RDD[Feature]
val df = rdd.toDF()
df.write.mode(SaveMode.Append).jdbc("jdbc:mysql://...", "features", new Properties)
I get the error, Column count doesn’t match value count at row 1. If I delete the id column it works. How could I insert this data into the table without changing the schema?
You have to include an id field in the DataFrame, but its value will be ignored and replaced with the auto-incremented ID. That is:
case class Feature(id: Int, name: String, value: Double)
Then just set id to 0, or any number when you create a Feature.