Avro with Enum in Hive - json

I have this (shortened) avro schema:
{
"type": "record",
"name": "license_upsert",
"namespace": "data.model",
"fields": [
{ "name": "upsert", "type":
{
"name": "EventType",
"type": "enum",
"symbols": ["INSERT", "UPDATE"]
}
}
]
}
Which just defined an ENUM.
I can easily create a avro file from some json data:
{
"upsert": "INSERT"
}
Using the avro-tools, it all works fine, to and from avro.
Now, these avro files are loaded in an external table in Hive, and boom, hive tells me that:
java.io.IOException: org.apache.avro.AvroTypeException: Found string, expecting data.model.EventType
According to the doc, hive does not actually support enum, but if I DESCRIBE the table, the field is seen as a string:
col_name | data_type | comment
-------------------------------
upsert | string | ""
Is there a way for me to tell hive that it should use a string? Even if I run a query not selecting the upsert field, I will get the same error.
Note1:
I create table as follow:
CREATE EXTERNAL TABLE IF NOT EXISTS events.event
PARTITIONED BY (year INT, month INT, day INT, hour INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.url'='file:///path/to/event.avsc'
)
STORED AS AVRO
LOCATION '/events/event'
;
Note2:
If I generate data from the avro-tools (random command) the data is perfectly loaded in Hive.
The data I am actually using is created by confluent.

The reason is that as said in the last line of the question:
The data I am actually using is created by confluent.
It turns out that on output with the HDFS sink, ENUMs are converted to Strings. As I created external tables in Hive based on my original schema, there was a discrepancy. Now if I just extract the schema from the file created by the hdfs sink, and use this one in the table definition everything works as expected.

Related

Compare two Json files using Apache Spark

I am new to Apache Spark and I am trying to compare two json files.
My requirement is to find out that which key/value is added, removed or modified and what is its path.
To explain my problem, I am sharing the code which I have tried with a small json sample here.
Sample Json 1 is:
{
"employee": {
"name": "sonoo",
"salary": 57000,
"married": true
} }
Sample Json 2 is:
{
"employee": {
"name": "sonoo",
"salary": 58000,
"married": true
} }
My code is:
//Compare two multiline json files
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//Load first json file
val jsonData_1 = sqlContext.read.json(sc.wholeTextFiles("D:\\File_1.json").values)
//Load second json file
val jsonData_2 = sqlContext.read.json(sc.wholeTextFiles("D:\\File_2.json").values)
//Compare both json files
jsonData_2.except(jsonData_1).show(false)
The output which I get on executing this code is:
+--------------------+
|employee |
+--------------------+
|{true, sonoo, 58000}|
+--------------------+
But here only one field i.e. salary was modified so output should be only the updated field with its path.
Below is the expected output details:
[
{
"op" : "replace",
"path" : "/employee/salary",
"value" : 58000
}
]
Can anyone point me in the right direction?
Assuming each json has an identifier, and that you have two json groups (e.g. folders), you need to compare b/w the jsons in the two groups:
Load the jsons from each group into a dataframe, providing a schema matching the structure of the son. After this, you have two dataframes.
Compare the jsons (by now rows in a dataframe) by joining on the identifiers, looking for mismatched values.

Hive external table read json as textfile

I'm trying to create a hive external table for a json file in .txt format. I have tried several approaches but I think I'm going wrong in how the hive external table should be defined:
My Sample JSON is:
[[
{
"user": "ron",
"id": "17110",
"addr": "Some address"
},
{
"user": "harry",
"id": "42230",
"addr": "some other address"
}]]
As you can see it's array inside an array. It seems that this is valid json, returned by an API, although I have read posts saying that json should start with a '{'
Anyway, I am trying to create an external table like this:
CREATE EXTERNAL TABLE db1.user(
array<array<
user:string,
id:string,
desc:string
>>)
PARTITIONED BY(date string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION '/tmp/data/addr'
This does not work. Nor does something like this work
CREATE EXTERNAL TABLE db1.user(
user string,
id string,
desc string
)PARTITIONED BY(date string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION '/tmp/data/addr'
After trying to modify the json text file, replacing [ with { etc., adding parition I still wasn't able to query it using select *. I'm missing a key piece in the table structure.
Can you please help me so that the table can read my JSON correctly?
If required, I can modify the input JSON, if the double [[ is a problem.
1st: Row in a table should be represented in a file as single line, no multi-line JSON.
2nd: You can have array<some complex type> as a single column, but this is not convenient because you will need to explode the array to be able to access nested elements. The only reason you may want such structure is when there are really multiple rows with array<array<>>.
3rd: Everything in [] is an array. Everything in {} is a struct or map, in your case it is a struct, and you have missed this rule. Fields user, id and desc are inside struct, and struct is nested inside array. Array can have only type in it's definition, if it is nested struct, then it will be array<struct<...>>, If array is of simple type then, for example array<string>.
4th: Your JSON is not valid one because it contains extra comma after address value, fix it.
If you prefer to have single column colname containing array<array<struct<...>>> then create table like this:
CREATE EXTERNAL TABLE db1.user(
colname array<array<
struct<user:string,
id:string,
desc:string>
>>)...
And JSON file should look like this (single line for each row):
[[{"user": "ron","id": "17110","addr": "Some address"}, {"user": "harry","id": "42230","addr": "some other address"}]]
If the file contains single big array nested in another array, better remove [[ and ]], remove commas between structs and extra newlines inside structs. If single row is a struct {}, you can define your table without upper struct<>, only nested structs should be defined as struct<>:
CREATE EXTERNAL TABLE db1.user(
user string,
id string,
desc string
)...
Note, that in this case you do not need : between column name and type. Use : only inside nested structs.
And JSON should look like this (whole JSON object as defined in DDL in a single line, no comma between structs, each struct in a separate line):
{"user": "ron","id": "17110","addr": "Some address"}
{"user": "harry","id": "42230","addr": "some other address"}
Hope you got how it works. Read more in the JSONSerDe manual.

Hive table with nested JSON as string value

I am trying to create a table from nested json.
The second layer of the the JSON is very complex and I don't want to keep the schema of that JSON in the table definition with struct column.
I am looking for solution that allow me to keep it as string.
for example:
{
"request_id": "3dbd4ee3-96fc-4342-bd62",
"payload": { < COMPLEX NESTED JSON > },
"timestamp": 1569161622
}
I was trying to use the following create statement:
CREATE EXTERNAL TABLE data (
request_id string,
payload string,
`timestamp` int
)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3a://bucket'
Is there any SerDe property/mapping I can use to define the nested object as String?
You can use org.openx.data.jsonserde.JsonSerDe SerDe
for more info on this SerDe refer [link] (https://github.com/rcongiu/Hive-JSON-Serde)
Hope this helps

Process events from Event hub using pyspark - Databricks

I have a Mongo change stream (a pymongo application) that is continuously getting the changes in collections. These change documents as received by the program are sent to Azure Event Hubs. A Spark notebook has to read the documents as they get into Event Hub and do Schema matching (match the fields in the document with spark table columns) with the spark table for that collection. If there are fewer fields in the document than in the table, columns have to be added with Null.
I am reading the events from Event Hub like below.
spark.readStream.format("eventhubs").option(**config).load().
As said in the documentation, the original message is in the 'body' column of the dataframe that I am converting to string. Now I have got the Mongo document as a JSON string in a streaming dataframe. I am facing below issues.
I need to extract the individual fields in the mongo document. This is needed to compare what fields are present in the spark table and what is not in Mongo document. I saw a function called get_json_object(col,path). This essentially returns a string again and I cannot individually select all the columns.
If from_json can be used to convert the JSON string to Struct type, I cannot specify the Schema because we have close to 70 collections (corresponding number of spark tables as well) each sending Mongo docs with fields from 10 to 450.
If I can convert the JSON string in streaming dataframe to a JSON object whose schema can be inferred by the dataframe (something like how read.json can do), I can use SQL * representation to extract the individual columns, do few manipulations & then save the final dataframe to the spark table. Is it possible to do that? What is the mistake I am making?
Note: Stram DF doesn't support collect() method to individually extract the JSON string from underlying rdd and do the necessary column comparisons. Using Spark 2.4 & Python in Azure Databricks environment 4.3.
Below is the sample data I get in my notebook after reading the events from event hub and casting it to string.
{
"documentKey": "5ab2cbd747f8b2e33e1f5527",
"collection": "configurations",
"operationType": "replace",
"fullDocument": {
"_id": "5ab2cbd747f8b2e33e1f5527",
"app": "7NOW",
"type": "global",
"version": "1.0",
"country": "US",
"created_date": "2018-02-14T18:34:13.376Z",
"created_by": "Vikram SSS",
"last_modified_date": "2018-07-01T04:00:00.000Z",
"last_modified_by": "Vikram Ganta",
"last_modified_comments": "Added new property in show_banners feature",
"is_active": true,
"configurations": [
{
"feature": "tip",
"properties": [
{
"id": "tip_mode",
"name": "Delivery Tip Mode",
"description": "Tip mode switches the display of tip options between percentage and amount in the customer app",
"options": [
"amount",
"percentage"
],
"default_value": "tip_percentage",
"current_value": "tip_percentage",
"mode": "multiple or single"
},
{
"id": "tip_amount",
"name": "Tip Amounts",
"description": "List of possible tip amount values",
"default_value": 0,
"options": [
{
"display": "No Tip",
"value": 0
}
]
}
]
}
]
}
}
I would like to separate and take out the full_document in the sample above. When I use get_json_object, I am getting the full_document in another streaming dataframe as JSON string and not as an object. As you can see, there are some array types in full_document which I can explode (documentation says that explode is supported in streaming DF, but havent tried) but there are some objects also (like struct type) which I would like to extract the individual fields. I cannot use the SQL '*' notation because what get_json_object returns is a string and not the object itself.
Its convincing that this much varied Schema of the JSON would be better with schema mentioned explicitly. So I took it like, in a streaming environment with very different Schema of the incoming stream, its always better to specify the schema. So I am proceeding with get_json_object and from_json and reading the schema through a file.

Loading JSON file into BigQuery table: Schema changes

I am trying to load a json file into a BQ table. My schema looks something like:
{"_eventid": "1234", "Keywords":""}
{"_eventid": "4567", "Keywords":{"_text":"abcd"} }
From above, you can see that the schema changes for "Keywords." How do I deal with this? Using something like:
{
"name":"Keywords",
"type":"record",
"mode":"nullable",
"fields": [
{
"name":"_text",
"type":"string",
"mode":"nullable"
}
]
},
Only works for the second entry. For the first, I get the error:
Errors:
file-00000000: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. (error code: invalid)
JSON parsing error in row starting at position 0 at file: file- 00000000. Flat value specified for record field. Field: Keywords; Value: (error code: invalid)
Short Answer
Bigquery table is schema bounded. Whenever we try to ingest data which is not according to table schema we get error. In your case in the first record the value of Keywords is string but in the schema it is record with one nullable field which name is _text.
Workaround
You need to preprocess data before loading it to bigquery. If you have small json file you can write script and check if type of Keywords is record or string if it is string create the record first. So after preprocessing the file content would look like:
{"_eventid": "1234", "Keywords":{"_text": ""}}
{"_eventid": "4567", "Keywords":{"_text":"abcd"} }
According to your schema Keywords is nullable record. You can even ignore keywords which value is empty during preprocessing. After this step the input file would become.
{"_eventid": "1234"}
{"_eventid": "4567", "Keywords":{"_text":"abcd"} }
BigQuery now supports schema changes on load with
--schema_update_option=ALLOW_FIELD_ADDITION
--schema_update_option=ALLOW_FIELD_RELAXATION
options. See How to insert/append unstructured data to bigquery table for more details and examples with JSON loading.