I am trying to create a table from nested json.
The second layer of the the JSON is very complex and I don't want to keep the schema of that JSON in the table definition with struct column.
I am looking for solution that allow me to keep it as string.
for example:
{
"request_id": "3dbd4ee3-96fc-4342-bd62",
"payload": { < COMPLEX NESTED JSON > },
"timestamp": 1569161622
}
I was trying to use the following create statement:
CREATE EXTERNAL TABLE data (
request_id string,
payload string,
`timestamp` int
)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3a://bucket'
Is there any SerDe property/mapping I can use to define the nested object as String?
You can use org.openx.data.jsonserde.JsonSerDe SerDe
for more info on this SerDe refer [link] (https://github.com/rcongiu/Hive-JSON-Serde)
Hope this helps
Related
I'm trying to create a hive external table for a json file in .txt format. I have tried several approaches but I think I'm going wrong in how the hive external table should be defined:
My Sample JSON is:
[[
{
"user": "ron",
"id": "17110",
"addr": "Some address"
},
{
"user": "harry",
"id": "42230",
"addr": "some other address"
}]]
As you can see it's array inside an array. It seems that this is valid json, returned by an API, although I have read posts saying that json should start with a '{'
Anyway, I am trying to create an external table like this:
CREATE EXTERNAL TABLE db1.user(
array<array<
user:string,
id:string,
desc:string
>>)
PARTITIONED BY(date string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION '/tmp/data/addr'
This does not work. Nor does something like this work
CREATE EXTERNAL TABLE db1.user(
user string,
id string,
desc string
)PARTITIONED BY(date string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION '/tmp/data/addr'
After trying to modify the json text file, replacing [ with { etc., adding parition I still wasn't able to query it using select *. I'm missing a key piece in the table structure.
Can you please help me so that the table can read my JSON correctly?
If required, I can modify the input JSON, if the double [[ is a problem.
1st: Row in a table should be represented in a file as single line, no multi-line JSON.
2nd: You can have array<some complex type> as a single column, but this is not convenient because you will need to explode the array to be able to access nested elements. The only reason you may want such structure is when there are really multiple rows with array<array<>>.
3rd: Everything in [] is an array. Everything in {} is a struct or map, in your case it is a struct, and you have missed this rule. Fields user, id and desc are inside struct, and struct is nested inside array. Array can have only type in it's definition, if it is nested struct, then it will be array<struct<...>>, If array is of simple type then, for example array<string>.
4th: Your JSON is not valid one because it contains extra comma after address value, fix it.
If you prefer to have single column colname containing array<array<struct<...>>> then create table like this:
CREATE EXTERNAL TABLE db1.user(
colname array<array<
struct<user:string,
id:string,
desc:string>
>>)...
And JSON file should look like this (single line for each row):
[[{"user": "ron","id": "17110","addr": "Some address"}, {"user": "harry","id": "42230","addr": "some other address"}]]
If the file contains single big array nested in another array, better remove [[ and ]], remove commas between structs and extra newlines inside structs. If single row is a struct {}, you can define your table without upper struct<>, only nested structs should be defined as struct<>:
CREATE EXTERNAL TABLE db1.user(
user string,
id string,
desc string
)...
Note, that in this case you do not need : between column name and type. Use : only inside nested structs.
And JSON should look like this (whole JSON object as defined in DDL in a single line, no comma between structs, each struct in a separate line):
{"user": "ron","id": "17110","addr": "Some address"}
{"user": "harry","id": "42230","addr": "some other address"}
Hope you got how it works. Read more in the JSONSerDe manual.
I store into my MongoDB collection a huge list of JSON strings. For simplicity, I have extracted a sample document into the text file businessResource.json:
{
"data" : {
"someBusinessData" : {
"capacity" : {
"fuelCapacity" : NumberLong(282)
},
"someField" : NumberLong(16),
"anotherField" : {
"isImportant" : true,
"lastDateAndTime" : "2008-01-01T11:11",
"specialFlag" : "YMA"
},
...
}
My problem: how can I convert the "someBusinessData" into a JSON object using Spark/Scala?
If I do that (for example using json4s or lift-json), I hope I can perform basic operations on them, for example checking them for equality.
Have in mind that this is a rather large JSON object. Creating a case class is not worth it in my case since the only operation I will perform will be some filtering on two fields, comparing documents for equality, and then I will export them again.
This is how I fetch the data:
val df: DataFrame = (someSparkSession).sqlContext.read.json("src/test/resources/businessResource.json")
val myData: DataFrame = df.select("data.someBusinessData")
myData.printSchema
The schema shows:
root
|-- someBusinessData: struct (nullable = true)
| |-- capacity: struct (nullable = true)
Since "someBusinessData" is a structure, I cannot get it as String. When I try to print using
myData.first.getStruct(0), I get a string that contains the values but not the keys: [[[282],16,[true,2008-01-01T11:11,YMA]
Thanks for your help!
Instead of using .json use .textFile to read your json file.
Then we convert rdd to dataframe(will have only one string column).
Example:
//read json file as textfile and create df
val df=spark.sparkContext.textFile("<json_file_path>").toDF("str")
//use get_json_object function to traverse json string
df.selectExpr("""get_json_object(str,"$.data.someBusinessData")""").show(false)
//+-----------------------------------------------------------------------------------------------------------------------------------------------------+
//|get_json_object(str,$.data.someBusinessData) |
//+-----------------------------------------------------------------------------------------------------------------------------------------------------+
//|{"capacity":{"fuelCapacity":"(282)"},"someField":"(16)","anotherField":{"isImportant":true,"lastDateAndTime":"2008-01-01T11:11","specialFlag":"YMA"}}|
//+-----------------------------------------------------------------------------------------------------------------------------------------------------+
In fact, my post containd two questions:
How can I convert the "someBusinessData" into a JSON object using Spark/Scala?
How can I get the JSON object as a String?
1. Conversion into a JSON object
What I was doing, was already creating a DataFrame that can be navigated as a JSON object:
//read json file as Json and select the needed data
val df: DataFrame = sparkSession.sqlContext.read.json(filePath).select("data.someBusinessData")
If you do .textFile you correctly get the String, but parse the JSON you then need to resort to the answer from Shu.
2. How can I get the JSON object as a String?
Trivially:
df.toJSON.first
I'm trying to generate a JSON string by combining various columns and save the JSON into Postgres table having JSON datatype. From the documentation, it is clear about reading feom JSON string.
define stream InputStream(json string);
from InputStream
select json:getString(json,"$.name") as name
insert into OutputStream;
But can we build the JSON in-flight and insert into table? Something like...
select '{"myname":json:getString(json,"$.name")}' as nameJSON
insert into postgresDB
Where nameJSON will be a JSON datatype in Postgres.
Any help would be greatly appreciated.
You can use, JSON:setElement to create a JSON from the attributes
from OutputStream
select json:setElement("{}", "$", json:getString(json,"$.name"), "myname") as value
insert into TempStream;
I have this (shortened) avro schema:
{
"type": "record",
"name": "license_upsert",
"namespace": "data.model",
"fields": [
{ "name": "upsert", "type":
{
"name": "EventType",
"type": "enum",
"symbols": ["INSERT", "UPDATE"]
}
}
]
}
Which just defined an ENUM.
I can easily create a avro file from some json data:
{
"upsert": "INSERT"
}
Using the avro-tools, it all works fine, to and from avro.
Now, these avro files are loaded in an external table in Hive, and boom, hive tells me that:
java.io.IOException: org.apache.avro.AvroTypeException: Found string, expecting data.model.EventType
According to the doc, hive does not actually support enum, but if I DESCRIBE the table, the field is seen as a string:
col_name | data_type | comment
-------------------------------
upsert | string | ""
Is there a way for me to tell hive that it should use a string? Even if I run a query not selecting the upsert field, I will get the same error.
Note1:
I create table as follow:
CREATE EXTERNAL TABLE IF NOT EXISTS events.event
PARTITIONED BY (year INT, month INT, day INT, hour INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.url'='file:///path/to/event.avsc'
)
STORED AS AVRO
LOCATION '/events/event'
;
Note2:
If I generate data from the avro-tools (random command) the data is perfectly loaded in Hive.
The data I am actually using is created by confluent.
The reason is that as said in the last line of the question:
The data I am actually using is created by confluent.
It turns out that on output with the HDFS sink, ENUMs are converted to Strings. As I created external tables in Hive based on my original schema, there was a discrepancy. Now if I just extract the schema from the file created by the hdfs sink, and use this one in the table definition everything works as expected.
I have one big json object
how can i achieve below sql in PostgreSQL without using table
SELECT value->'col1' AS mycolumn
FROM json_object_keys('{"jcol1": "A", "jcol2": "B"}') as value
Json is {"activities-heart":[{"dateTime":"2016-10-17","restingHeartRate":65}}]}
expected output heartrate :65