circe doesn't see field when it contains an array - json

I've got 2 "tests", of which the one where I'm trying to decode a user works, but the one where I'm trying to decode a list of users doesn't:
import User._
import io.circe._
import io.circe.syntax._
import io.circe.parser.decode
class UserSuite extends munit.FunSuite:
test("List of users can be decoded") {
val json = """|{
| "data" : [
| {
| "id" : "someId",
| "name" : "someName",
| "username" : "someusername"
| },
| {
| "id" : "someId",
| "name" : "someName",
| "username" : "someusername"
| }
| ]
|}""".stripMargin
println(decode[List[User]](json))
}
test("user can be decoded") {
val json = """|{
| "data" : {
| "id" : "someId",
| "name" : "someName",
| "username" : "someusername"
| }
|}""".stripMargin
println(decode[User](json))
}
The failing one produces
Left(DecodingFailure(List, List(DownField(data))))
despite the fact that both the json's relevant structure and the decoders (below) are the same.
final case class User(
id: String,
name: String,
username: String
)
object User:
given Decoder[List[User]] =
deriveDecoder[List[User]].prepare(_.downField("data"))
given Decoder[User] =
deriveDecoder[User].prepare(_.downField("data"))
As far as I understand this should work, even according to one of Travis' older replies but it doesn't.
Is this a bug? Am I doing something wrong?
For reference, This is Scala 3.2.0 and circe 0.14.1

The thing is that that you need two different encoders for User, the one expecting data field to decode the 2nd json and the one not expecting data field while deriving decoder for a list. Otherwise the 1st json should be
"""|{
| "data" : [
| {
| "data" :
| {
| "id" : "someId",
| "name" : "someName",
| "username" : "someusername"
| }
| },
| {
| "data" :
| {
| "id" : "someId",
| "name" : "someName",
| "username" : "someusername"
| }
| }
| ]
|}""
It's better to be explicit now
final case class User(
id: String,
name: String,
username: String
)
object User {
val userDec: Decoder[User] = semiauto.deriveDecoder[User]
val preparedUserDec: Decoder[User] = userDec.prepare(_.downField("data"))
val userListDec: Decoder[List[User]] = {
implicit val dec: Decoder[User] = userDec
Decoder[List[User]].prepare(_.downField("data"))
}
}
val json =
"""|{
| "data" : [
| {
| "id" : "someId",
| "name" : "someName",
| "username" : "someusername"
| },
| {
| "id" : "someId",
| "name" : "someName",
| "username" : "someusername"
| }
| ]
|}""".stripMargin
decode[List[User]](json)(User.userListDec)
// Right(List(User(someId,someName,someusername), User(someId,someName,someusername)))
val json1 =
"""|{
| "data" : {
| "id" : "someId",
| "name" : "someName",
| "username" : "someusername"
| }
|}""".stripMargin
decode[User](json1)(User.preparedUserDec)
// Right(User(someId,someName,someusername))

Related

How to insert a new key-value into the first row of a table containing a json (Snowflake)

I have a table "MY_TABLE" with one column "VALUE" and the first row of the column contains a json that looks like:
{
"VALUE": {
"c1": "name",
"c10": "age",
"c100": "gender",
"c101": "address",
"c102": "status"
}
}
I would like to add a new key-value pair to this json in the first row where the pair is "c125" : "job" so that the result looks like:
{
"VALUE": {
"c1": "name",
"c10": "age",
"c100": "gender",
"c101": "address",
"c102": "status",
"c125": "job"
}
}
I tried:
SELECT object_insert(OBJECT_CONSTRUCT(*),'c125', 'job') FROM MY_TABLE;
But it inserted the new key value pair into the wrong spot so the result looks like:
{
"VALUE": {
"c1": "name",
"c10": "age",
"c100": "gender",
"c101": "address",
"c102": "status"
},
"c125": "job"
}
Is there another way to do this? Thanks!
Another, similar approach, using OBJECT_INSERT -
For original table (assuming, column data-type is variant, else use parse_json function) -
select * from temp_1;
+------------------------+
| COL1 |
|------------------------|
| { |
| "VALUE": { |
| "c1": "name", |
| "c10": "age", |
| "c100": "gender", |
| "c101": "address", |
| "c102": "status" |
| } |
| } |
+------------------------+
Query with added key ("c31":101) as output -
select
object_insert(col1,'VALUE',object_insert(col1:VALUE,'c31',101),TRUE)
as output_col from temp_1;
+------------------------+
| OUTPUT_COL |
|------------------------|
| { |
| "VALUE": { |
| "c1": "name", |
| "c10": "age", |
| "c100": "gender", |
| "c101": "address", |
| "c102": "status", |
| "c31": 101 |
| } |
| } |
+------------------------+
Clause used in a update (can be predicated based on another column to be used a key) -
update temp_1 set col1 = object_insert(col1,'VALUE',object_insert(col1:VALUE,'c31',101),TRUE);
After update -
select * from temp_1;
+------------------------+
| COL1 |
|------------------------|
| { |
| "VALUE": { |
| "c1": "name", |
| "c10": "age", |
| "c100": "gender", |
| "c101": "address", |
| "c102": "status", |
| "c31": 101 |
| } |
| } |
+------------------------+
One approach could be flattening the result first and construct again:
CREATE TABLE MY_TABLE
AS
SELECT PARSE_JSON('{
"VALUE": {
"c1": "name",
"c10": "age",
"c100": "gender",
"c101": "address",
"c102": "status"
}
}') AS VALUE;
SELECT * FROM MY_TABLE;
Before:
Query:
WITH cte(key, value) AS (
SELECT 'c125', 'job'::VARIANT
UNION ALL
SELECT s.key, s.value
FROM MY_TABLE
,TABLE(FLATTEN (input => VALUE, path => 'VALUE')) s
)
SELECT OBJECT_CONSTRUCT('VALUE', OBJECT_AGG(key, value))
FROM cte;
Output:

Transforming Pandas Dataframe into JSON with Struct and Array structure for Upload to BigQuery

Suppose I have data from a DataFrame with columns id, title, and category, subcategory, and sub-subcategory that looks like:
_________________________________________________________________
| id | title | cat | subcat | subsubcat |
|____|______________|______________|_____________|______________|
| 1 | My Book | cat1 | subcat1 | subsubcat1 |
| 1 | My Book | cat2 | subcat2 | subsubcat2 |
| 2 | My Other Book| othercat1 | othersubcat1| othersubcat1 |
| 2 | My Other Book| othercat2 | othersubcat2| null |
| 2 | My Other Book| othercat3 | null | null |
|_______________________________________________________________|
I want to turn into this into a (newline-delimited) json that has structure like:
[
{
'id' : '1',
'title' : 'My Book',
'categoryHiearchies': [
{'categories': ['category1', 'subcategory1', 'sub-subcategory1']},
{'categories': ['category2', 'subcategory2', 'sub-subcategory2']}
]
},
{
'id' : '2',
'title' : 'My Other Book',
'categoryHiearchies': [
{'categories': ['othercategory1', 'othersubcategory1', 'othersub-subcategory1']},
{'categories': ['othercategory2', 'othersubcategory2']},
{'categories': ['othercategory3']},
]
}
]
in order to properly upload it to BigQuery.
Any ideas how to apply this transformation?
Assuming the null are Nan values:
(df.set_index(['id','title'],append=True).stack()
.groupby(level=[0,1,2]).agg(lambda x: {'categories':list(x)})
.groupby(level=[1,2]).agg(list)
.reset_index(name='categoryHiearchies')
.to_json(orient='records', indent=2)
)
which gives
[
{
"id":1,
"title":"My Book",
"categoryHiearchies":[
{
"categories":[
"cat1",
"subcat1",
"subsubcat1"
]
},
{
"categories":[
"cat2",
"subcat2",
"subsubcat2"
]
}
]
},
{
"id":2,
"title":"My Other Book",
"categoryHiearchies":[
{
"categories":[
"othercat1",
"othersubcat1",
"othersubcat1"
]
},
{
"categories":[
"othercat2",
"othersubcat2"
]
},
{
"categories":[
"othercat3"
]
}
]
}
]

How to make nested JSON response in Go?

I am new in Go and need some help.
In my PostgreSQL database I have 4 table. They called: surveys, questions, options and surveys_questions_options.
They looks like this:
surveys table:
| survey_id (uuid4) | survey_name (varchar) |
|--------------------------------------|-----------------------|
| 0cf1cf18-d5fd-474e-a8be-754fbdc89720 | April |
| b9fg55d9-n5fy-s7fe-s5bh-856fbdc89720 | May |
questions table:
| question_id (int) | question_text (text) |
|-------------------|------------------------------|
| 1 | What is your favorite color? |
options table:
| option_id (int) | option_text (text) |
|-------------------|--------------------|
| 1 | red |
| 2 | blue |
| 3 | grey |
| 4 | green |
| 5 | brown |
surveys_questions_options table combines data from all three previous tables:
| survey_id | question_id | option_id |
|--------------------------------------|-------------|-----------|
| 0cf1cf18-d5fd-474e-a8be-754fbdc89720 | 1 | 1 |
| 0cf1cf18-d5fd-474e-a8be-754fbdc89720 | 1 | 2 |
| 0cf1cf18-d5fd-474e-a8be-754fbdc89720 | 1 | 3 |
| b9fg55d9-n5fy-s7fe-s5bh-856fbdc89720 | 1 | 3 |
| b9fg55d9-n5fy-s7fe-s5bh-856fbdc89720 | 1 | 4 |
| b9fg55d9-n5fy-s7fe-s5bh-856fbdc89720 | 1 | 5 |
How can I make nested JSON response in Go? I use GORM library. I want a JSON response like this:
[
{
"survey_id": "0cf1cf18-d5fd-474e-a8be-754fbdc89720",
"survey_name": "April",
"questions": [
{
"question_id": 1,
"question_text": "What is your favorite color?",
"options": [
{
"option_id": 1,
"option_text": "red"
},
{
"option_id": 2,
"option_text": "blue"
},
{
"option_id": 3,
"option_text": "grey"
},
]
}
]
},
{
"survey_id": "b9fg55d9-n5fy-s7fe-s5bh-856fbdc89720",
"survey_name": "May",
"questions": [
{
"question_id": 1,
"question_text": "What is your favorite color?",
"options": [
{
"option_id": 3,
"option_text": "grey"
},
{
"option_id": 4,
"option_text": "green"
},
{
"option_id": 5,
"option_text": "brown"
},
]
}
]
}
]
My models looks like this:
type Survey struct {
SurveyID string `gorm:"primary_key" json:"survey_id"`
SurveyName string `gorm:"not null" json:"survey_name"`
Questions []Question
}
type Question struct {
QuestionID int `gorm:"primary_key" json:"question_id"`
QuestionText string `gorm:"not null;unique" json:"question_text"`
Options []Option
}
type Option struct {
OptionID int `gorm:"primary_key" json:"option_id"`
OptionText string `gorm:"not null;unique" json:"option_text"`
}
I'm not sure abour GORM part, but with JSON you need to add struct tags on the nested objects as well:
type Survey struct {
...
Questions []Question `json:"questions"`
}
type Question struct {
...
Options []Option `json:"options"`
}
We're missing some scope from your code, and so it's quite hard to point you in the right direction. Are you asking about querying GORM so you get []Survey, or are you asking about marshalling []Survey? Anyway, you should add the tag to Questions too, as slomek replied.
However, try this:
To fetch nested data in m2m relation
type Survey struct {
gorm.Model
SurveyID string `gorm:"primary_key" json:"survey_id"`
SurveyName string `gorm:"not null" json:"survey_name"`
Questions []*Question `gorm:"many2many:survey_questions;"`
}
surveys := []*model.Survey{}
db := dbSession.Where(&model.Survey{SurveyID: id}).Preload("Questions").Find(&surveys)

Nested Json Structure modelling to NoSQL attribute

I want to create a table in Dynamo DB with below structure.
{
"id" : 123,
"name" : "xyz",
"info" : {
"key1" : "val1",
"key2": {
"key3": [
"val3",
],
}
}
}
The simplest solution is
| id | name | info |
| 123 | xyz | {json string} |
But what are some better solution, other than making the "key1"/"key2" as TOP-LEVEL attributes.
Please help, with a more optimal solution !

Load multi-line JSON data into HIVE table

I have a JSON data which is a multi-line JSON. I have created a hive table to load that data into it. I have another JSON which is a single-line JSON record. When I load the single-line JSON record to its hive table and try to query, it works fine. But when I load the multi-line JSON into its hive table, it gives below exception:
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeExcep‌​tion: org.codehaus.jackson.JsonParseException: Unexpected end-of-input: expected close marker for OBJECT (from [Source: java.io.ByteArrayInputStream#8b89b3a; line: 1, column: 0]) at [Source: java.io.ByteArrayInputStream#8b89b3a; line: 1, column: 3]
Below is my JSON data:
{
"uploadTimeStamp" : "1486631318873",
"PDID" : "123",
"data" : [ {
"Data" : {
"unit" : "rpm",
"value" : "0"
},
"EventID" : "E1",
"PDID" : "123",
"Timestamp" : 1486631318873,
"Timezone" : 330,
"Version" : "1.0",
"pii" : { }
}, {
"Data" : {
"heading" : "N",
"loc3" : "false",
"loc" : "14.022425",
"loc1" : "78.760587",
"loc4" : "false",
"speed" : "10"
},
"EventID" : "E2",
"PDID" : "123",
"Timestamp" : 1486631318873,
"Timezone" : 330,
"Version" : "1.1",
"pii" : { }
}, {
"Data" : {
"x" : "1.1",
"y" : "1.2",
"z" : "2.2"
},
"EventID" : "E3",
"PDID" : "123",
"Timestamp" : 1486631318873,
"Timezone" : 330,
"Version" : "1.0",
"pii" : { }
}, {
"EventID" : "E4",
"Data" : {
"value" : "50",
"unit" : "percentage"
},
"Version" : "1.0",
"Timestamp" : 1486631318873,
"PDID" : "123",
"Timezone" : 330
}, {
"Data" : {
"unit" : "kmph",
"value" : "70"
},
"EventID" : "E5",
"PDID" : "123",
"Timestamp" : 1486631318873,
"Timezone" : 330,
"Version" : "1.0",
"pii" : { }
} ]
}
I am using /hive/lib/hive-hcatalog-core-0.13.0.jar
Below is my create table statement:
create table test7(
uploadtime bigint,
pdid string,
data array<
struct<Data:struct<
unit:string,
value:int>,
eventid:string,
pdid:bigint,
time:bigint,
timezone:int,
version:int,
pii:struct<pii:string>>,
struct<Data:struct<
heading:string,
Location:string,
latitude:bigint,
longitude:bigint,
Location2:string,
speed:int>,
eventid:string,
pdid:bigint,
time:bigint,
timezone:int,
version:int,
pii:struct<pii:string>>,
struct<Data:struct<
unit:string,
value:int>,
eventid:string,
pdid:bigint,
time:bigint,
timezone:int,
version:int,
pii:struct<pii:string>>,
struct<Data:struct<
x:int,
y:int,
z:int>,
eventid:string,
pdid:bigint,
time:bigint,
timezone:int,
version:int,
pii:struct<pii:string>>,
struct<Data:struct<
heading:string,
loc3:string,
latitude:bigint,
longitude:bigint,
loc4:string,
speed:int>,
eventid:string,
pdid:bigint,
time:bigint,
timezone:int,
version:int,
pii:struct<pii:string>>
>
)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION
'/xyz/abc/';
Edit:
Adding the single line JSON and new table create stmt with error:
{"uploadTimeStamp":"1487183800905","PDID":"123","data":[{"Data":{"unit":"rpm","value":"0"},"EventID":"event1","PDID":"123","Timestamp":1487183800905,"Timezone":330,"Version":"1.0","pii":{}},{"Data":{"heading":"N","loc1":"false","latitude":"16.032425","longitude":"80.770587","loc2":"false","speed":"10"},"EventID":"event2","PDID":"123","Timestamp":1487183800905,"Timezone":330,"Version":"1.1","pii":{}},{"Data":{"x":"1.1","y":"1.2","z":"2.2"},"event3":"AccelerometerInfo","PDID":"123","Timestamp":1487183800905,"Timezone":330,"Version":"1.0","pii":{}},{"EventID":"event4","Data":{"value":"50","unit":"percentage"},"Version":"1.0","Timestamp":1487183800905,"PDID":"123","Timezone":330},{"Data":{"unit":"kmph","value":"70"},"EventID":"event5","PDID":"123","Timestamp":1487183800905,"Timezone":330,"Version":"1.0","pii":{}}]}
create table test1(
uploadTimeStamp string,
PDID string,
data array<struct<
Data:struct<unit:string,value:int>,
EventID:string,
PDID:string,
TimeS:bigint,
Timezone:int,
Version:float,
pii:struct<>>,
struct<
Data:struct<heading:string,loc1:string,latitude:double,longitude:double,loc2:string,speed:int>,
EventID:string,
PDID:string,
TimeS:bigint,
Timezone:int,
Version:float,
pii:struct<>>,
struct<
Data:struct<x:float,y:float,z:float>,
EventID:string,
PDID:string,
TimeS:bigint,
Timezone:int,
Version:float,
pii:struct<>>,
struct<
EventID:string,
Data:struct<value:int,unit:percentage>,
Version:float,
TimeS:bigint,
PDID:string,
Timezone:int>,
struct<
Data:struct<unit:string,value:int>,
EventID:string,
PDID:string,
TimeS:bigint,
Timezone:int,
Version:float,
pii:struct<>>
>
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION
'/ABC/XYZ/';
MismatchedTokenException(320!=313)
...
...
...
FAILED: ParseException line 11:10 mismatched input '<>' expecting < near 'struct' in struct type
Sample data
{"uploadTimeStamp":"1486631318873","PDID":"123","data":[{"Data":{"unit":"rpm","value":"0"},"EventID":"E1","PDID":"123","Timestamp":1486631318873,"Timezone":330,"Version":"1.0","pii":{}},{"Data":{"heading":"N","loc3":"false","loc":"14.022425","loc1":"78.760587","loc4":"false","speed":"10"},"EventID":"E2","PDID":"123","Timestamp":1486631318873,"Timezone":330,"Version":"1.1","pii":{}},{"Data":{"x":"1.1","y":"1.2","z":"2.2"},"EventID":"E3","PDID":"123","Timestamp":1486631318873,"Timezone":330,"Version":"1.0","pii":{}},{"EventID":"E4","Data":{"value":"50","unit":"percentage"},"Version":"1.0","Timestamp":1486631318873,"PDID":"123","Timezone":330},{"Data":{"unit":"kmph","value":"70"},"EventID":"E5","PDID":"123","Timestamp":1486631318873,"Timezone":330,"Version":"1.0","pii":{}}]}
add jar /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar
create external table myjson
(
uploadTimeStamp string
,PDID string
,data array
<
struct
<
Data:struct
<
unit:string
,value:string
,heading:string
,loc3:string
,loc:string
,loc1:string
,loc4:string
,speed:string
,x:string
,y:string
,z:string
>
,EventID:string
,PDID:string
,`Timestamp`:bigint
,Timezone:smallint
,Version:string
,pii:struct<dummy:string>
>
>
)
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
stored as textfile
location '/tmp/myjson'
;
select * from myjson
;
+------------------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| myjson.uploadtimestamp | myjson.pdid | myjson.data |
+------------------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 1486631318873 | 123 | [{"data":{"unit":"rpm","value":"0","heading":null,"loc3":null,"loc":null,"loc1":null,"loc4":null,"speed":null,"x":null,"y":null,"z":null},"eventid":"E1","pdid":"123","timestamp":1486631318873,"timezone":330,"version":"1.0","pii":{"dummy":null}},{"data":{"unit":null,"value":null,"heading":"N","loc3":"false","loc":"14.022425","loc1":"78.760587","loc4":"false","speed":"10","x":null,"y":null,"z":null},"eventid":"E2","pdid":"123","timestamp":1486631318873,"timezone":330,"version":"1.1","pii":{"dummy":null}},{"data":{"unit":null,"value":null,"heading":null,"loc3":null,"loc":null,"loc1":null,"loc4":null,"speed":null,"x":"1.1","y":"1.2","z":"2.2"},"eventid":"E3","pdid":"123","timestamp":1486631318873,"timezone":330,"version":"1.0","pii":{"dummy":null}},{"data":{"unit":"percentage","value":"50","heading":null,"loc3":null,"loc":null,"loc1":null,"loc4":null,"speed":null,"x":null,"y":null,"z":null},"eventid":"E4","pdid":"123","timestamp":1486631318873,"timezone":330,"version":"1.0","pii":null},{"data":{"unit":"kmph","value":"70","heading":null,"loc3":null,"loc":null,"loc1":null,"loc4":null,"speed":null,"x":null,"y":null,"z":null},"eventid":"E5","pdid":"123","timestamp":1486631318873,"timezone":330,"version":"1.0","pii":{"dummy":null}}] |
+------------------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
select j.uploadTimeStamp
,j.PDID
,d.val.EventID
,d.val.PDID
,d.val.`Timestamp`
,d.val.Timezone
,d.val.Version
,d.val.Data.unit
,d.val.Data.value
,d.val.Data.heading
,d.val.Data.loc3
,d.val.Data.loc
,d.val.Data.loc1
,d.val.Data.loc4
,d.val.Data.speed
,d.val.Data.x
,d.val.Data.y
,d.val.Data.z
from myjson j
lateral view explode (data) d as val
;
+-------------------+--------+---------+------+---------------+----------+---------+------------+-------+---------+-------+-----------+-----------+-------+-------+------+------+------+
| j.uploadtimestamp | j.pdid | eventid | pdid | timestamp | timezone | version | unit | value | heading | loc3 | loc | loc1 | loc4 | speed | x | y | z |
+-------------------+--------+---------+------+---------------+----------+---------+------------+-------+---------+-------+-----------+-----------+-------+-------+------+------+------+
| 1486631318873 | 123 | E1 | 123 | 1486631318873 | 330 | 1.0 | rpm | 0 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
| 1486631318873 | 123 | E2 | 123 | 1486631318873 | 330 | 1.1 | NULL | NULL | N | false | 14.022425 | 78.760587 | false | 10 | NULL | NULL | NULL |
| 1486631318873 | 123 | E3 | 123 | 1486631318873 | 330 | 1.0 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 1.1 | 1.2 | 2.2 |
| 1486631318873 | 123 | E4 | 123 | 1486631318873 | 330 | 1.0 | percentage | 50 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
| 1486631318873 | 123 | E5 | 123 | 1486631318873 | 330 | 1.0 | kmph | 70 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
+-------------------+--------+---------+------+---------------+----------+---------+------------+-------+---------+-------+-----------+-----------+-------+-------+------+------+------+
Was having the same issue, then decided to create a custom input format which can extract the multiline(pretty print) json records.
This JsonRecordReader can read a multiline JSON record in Hive. It is extracting the record based on balancing of curly braces - { and }. So the content between first '{' to the balanced last '}' is considered as one complete record. Below is the code snippet:
public static class JsonRecordReader implements RecordReader<LongWritable, Text> {
public static final String START_TAG_KEY = "jsoninput.start";
public static final String END_TAG_KEY = "jsoninput.end";
private byte[] startTag = "{".getBytes();
private byte[] endTag = "}".getBytes();
private long start;
private long end;
private FSDataInputStream fsin;
private final DataOutputBuffer buffer = new DataOutputBuffer();
public JsonRecordReader(FileSplit split, JobConf jobConf) throws IOException {
// uncomment the below lines if you need to get the configuration
// from JobConf:
// startTag = jobConf.get(START_TAG_KEY).getBytes("utf-8");
// endTag = jobConf.get(END_TAG_KEY).getBytes("utf-8");
// open the file and seek to the start of the split:
start = split.getStart();
end = start + split.getLength();
Path file = split.getPath();
FileSystem fs = file.getFileSystem(jobConf);
fsin = fs.open(split.getPath());
fsin.seek(start);
}
#Override
public boolean next(LongWritable key, Text value) throws IOException {
if (fsin.getPos() < end) {
AtomicInteger count = new AtomicInteger(0);
if (readUntilMatch(false, count)) {
try {
buffer.write(startTag);
if (readUntilMatch(true, count)) {
key.set(fsin.getPos());
// create json record from buffer:
String jsonRecord = new String(buffer.getData(), 0, buffer.getLength());
value.set(jsonRecord);
return true;
}
} finally {
buffer.reset();
}
}
}
return false;
}
#Override
public LongWritable createKey() {
return new LongWritable();
}
#Override
public Text createValue() {
return new Text();
}
#Override
public long getPos() throws IOException {
return fsin.getPos();
}
#Override
public void close() throws IOException {
fsin.close();
}
#Override
public float getProgress() throws IOException {
return ((fsin.getPos() - start) / (float) (end - start));
}
private boolean readUntilMatch(boolean withinBlock, AtomicInteger count) throws IOException {
while (true) {
int b = fsin.read();
// end of file:
if (b == -1)
return false;
// save to buffer:
if (withinBlock)
buffer.write(b);
// check if we're matching start/end tag:
if (b == startTag[0]) {
count.incrementAndGet();
if (!withinBlock) {
return true;
}
} else if (b == endTag[0]) {
count.getAndDecrement();
if (count.get() == 0) {
return true;
}
}
// see if we've passed the stop point:
if (!withinBlock && count.get() == 0 && fsin.getPos() >= end)
return false;
}
}
}
This input format can be used along with the JSON Serde supplied by hive to read the multiline JSON file.
CREATE TABLE books (id string, bookname string, properties struct<subscription:string, unit:string>) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS INPUTFORMAT 'JsonInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
The working code with samples is here: https://github.com/unayakdev/hive-json