JSON to HIVE ingestion - json

add jar /path to/hive-serdes-1.0-SNAPSHOT.jar;
CREATE EXTERNAL TABLE student
( id int, student_id INT, type STRING, score DOUBLE
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES ( 'mongo.columns.mapping'='{ "id":"_id",
"student_id":"student_id", "type":"type","score":"score" }' )
TBLPROPERTIES('mongo.uri'='mongodb://****---****.nam.nsroot.net:*****/admin.student');
I am able to successfully run the code and ingest data. But the "id" field gets populated as NULL.
Should i change the data type ? I tried STRING as well. Got the same result.

According to the mongo-hadoop Hive SerDe, ObjectId corresponds to a special instance of STRUCT.
A Hive field corresponding to an ObjectId must be a STRUCT with the fields oid, a STRING, and bsontype, an INT, and nothing else. The oid is the string of the ObjectId while the bsontype should always be 8. Per your example, it should be :
CREATE EXTERNAL TABLE student
(id STRUCT<oid:STRING, bsontype:INT>, student_id INT, type STRING, score DOUBLE)
Where the output would be something similar to:
{"oid":"56d6e0f6ff1f17f74ebbc16c","bsontype":8}
{"oid":"56d6e0f8ff1f17f74ebbc16d","bsontype":8}
...
The above was tested with: MongoDB v3.2.x, mongo-java-driver-3.2.2.jar, mongo-hadoop-core-1.5.0-rc0.jar, mongo-hadoop-hive-1.5.0-rc0.jar.

Related

Athena DDL statement for different data structures

I have data in XML form which I have converted in JSON format through glue crawler. The problem is in writing the DDL statement for a table in Athena as you can see below there is a contact attribute in JSON data. Somewhere it is a structure (single instance) and somewhere it is in array form (multiple instances). I am sharing the DDL statements below as well for each type.
JSON Data Type 1
"ContactList": {
"Contact": {
}
}
Athena DDL Statement
CREATE EXTERNAL TABLE IF NOT EXISTS table_name (
ContactList: struct<
Contact: struct<
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3_bucket_path'
TBLPROPERTIES ('has_encrypted_data'='false')
JSON Data Type 2
"ContactList": {
"Contact": [
{},
{}
]
}
Athena DDL Statement
CREATE EXTERNAL TABLE IF NOT EXISTS table_name (
ContactList: struct<
Contact: array <
struct<
>
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3_bucket_path'
TBLPROPERTIES ('has_encrypted_data'='false')
I am able to write DDL statement for one case at a time only and it work perfectly for individual type. My question is how we can write DDL statements so it can cater to both types either it is struct or array. Thanks in advance.
The way you solve this in Athena is that you use the string type for the Contact field of the ContactList column, and then JSON functions in your queries.
When you query you can for example do (assuming contacts have a "name" field):
SELECT
COALESCE(
json_extract_scalar(ContactList.Contact, '$.name[0]'),
json_extract_scalar(ContactList.Contact, '$.name')
) AS name
FROM table_name
This uses json_extract_scalar which parses a string as JSON and then extracts a value using a JSONPath expression. COALESCE picks the first non-null value, so if the first JSONPath expression does not yield any value (because the property is not an array), the second is attempted.

yelp data set parsing json in hive

create external table review
(
business_id string,
user_id string,
stars Double,
text string,
date date,
votes struct <
vote_type :string ,
count: int >)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
;
table is loaded correctly but getting error when trying to parse stars and date field
on hive ie ..select stars from review is giving error.
data set used is used from the below link and is in json format
https://www.yelp.com/dataset_challenge
You should give pointer such as LOCATION '/user/ruchit31/god/' so that your table will point that location. Modify your create table query
create external table review
( business_id string,
user_id string,
stars Double,
text string,
date date,
votes struct < vote_type :string , count: int >
)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/path/'

Insert DataFrame into SQL table with AUTO_INCREMENT column

I have a MySQL table which includes a column that is AUTO_INCREMENT:
CREATE TABLE features (
id INT NOT NULL AUTO_INCREMENT,
name CHAR(30),
value DOUBLE PRECISION
);
I created a DataFrame and wanted to insert it into this table.
case class Feature(name: String, value: Double)
val rdd: RDD[Feature]
val df = rdd.toDF()
df.write.mode(SaveMode.Append).jdbc("jdbc:mysql://...", "features", new Properties)
I get the error, Column count doesn’t match value count at row 1. If I delete the id column it works. How could I insert this data into the table without changing the schema?
You have to include an id field in the DataFrame, but its value will be ignored and replaced with the auto-incremented ID. That is:
case class Feature(id: Int, name: String, value: Double)
Then just set id to 0, or any number when you create a Feature.

How can I parse a Json column of a Hive table using a Json serde?

I am trying to load de-serialized json events into different tables, based on the name of the event.
Right now I have all the events in the same table, the table has only two columns EventName and Payload (the payload stores the json representation of the event):
CREATE TABLE event( EventName STRING, Payload STRING)
So basically what i want is to load the data in the following table:
CREATE TABLE TempEvent ( Column1 STRING, Column2 STRING, Column3 STRING )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;
And load the events with something like:
INSERT INTO TempEvent select Payload from event where EventName='TempEvent';
But hive is throwing an exception saying that the destination table has 3 columns, and the select statement just 1.
Is there other way to acomplish this or i am doing something wrong?.
The JSON serde requires a table with one JSON per line in order to use it. So it won't work with your input table because the line
TempEvent, {"Column1":"value1","Column2":"value2","Column3":"value3"}
is not a valid JSON. So first you need to move it into a new intermediate table which just contains valid JSON, and then populate the JSON serde table from there using load data:
create table event_json (Payload string)
stored as textfile;
insert into table event_json
select Payload from event
where EventName='TempEvent';
create table TempEvent (Column1 string, Column2 string, Column3 string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
load data inpath '/user/hive/warehouse/event_json' overwrite into table TempEvent;
Then you can extract the columns like this:
select Column1, Column2, Column3
from TempEvent;
Of course all of this processing would not be necessary if your source table was valid JSON originally, you could just create the TempEvent table as as an external table and pull data directly from it.

how to parse json using json_populate_recordset in postgres

I have a json stored as text in one of my database row. the json data is as following
[{"id":67272,"name":"EE_Quick_Changes_J_UTP.xlsx"},{"id":67273,"name":"16167.txt"},{"id":67274,"name":"EE_12_09_2013_Bcum_Searchall.png"}]
to parse this i want to use postgresql method
json_populate_recordset()
when I post a command like
select json_populate_recordset(null::json,'[{"id":67272,"name":"EE_Quick_Changes_J_UTP.xlsx"},{"id":67273,"name":"16167.txt"},{"id":67274,"name":"EE_12_09_2013_Bcum_Searchall.png"}]') from anoop;
it gives me following error
first argument of json_populate_recordset must be a row type
note : in the from clause "anoop" is the table name.
can anyone suggest me how to use the json_populate_recordset method to extract data from this json string.
I got method's reference from
http://www.postgresql.org/docs/9.3/static/functions-json.html
The first argument passed to pgsql function json_populate_recordsetshould be a row type. If you want to use the json array to populate the existing table anoop you can simply pass the table anoop as the row type like this:
insert into anoop
select * from json_populate_recordset(null::anoop,
'[{"id":67272,"name":"EE_Quick_Changes_J_UTP.xlsx"},
{"id":67273,"name":"16167.txt"},
{"id":67274,"name":"EE_12_09_2013_Bcum_Searchall.png"}]');
Here the null is the default value to insert into table columns not set in the json passed.
If you don't have an existing table, you need to create a row type to hold your json data (ie. column
names and their types) and pass it as the first parameter, like this anoop_type:
create TYPE anoop_type AS (id int, name varchar(100));
select * from json_populate_recordset(null :: anoop_type,
'[...]') --same as above
no need to create a new type for that.
select * from json_populate_recordset(null::record,'[{"id_item":1,"id_menu":"34"},{"id_item":2,"id_menu":"35"}]')
AS
(
id_item int
, id_menu int
)