Hive external table read json as textfile - json

I'm trying to create a hive external table for a json file in .txt format. I have tried several approaches but I think I'm going wrong in how the hive external table should be defined:
My Sample JSON is:
[[
{
"user": "ron",
"id": "17110",
"addr": "Some address"
},
{
"user": "harry",
"id": "42230",
"addr": "some other address"
}]]
As you can see it's array inside an array. It seems that this is valid json, returned by an API, although I have read posts saying that json should start with a '{'
Anyway, I am trying to create an external table like this:
CREATE EXTERNAL TABLE db1.user(
array<array<
user:string,
id:string,
desc:string
>>)
PARTITIONED BY(date string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION '/tmp/data/addr'
This does not work. Nor does something like this work
CREATE EXTERNAL TABLE db1.user(
user string,
id string,
desc string
)PARTITIONED BY(date string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION '/tmp/data/addr'
After trying to modify the json text file, replacing [ with { etc., adding parition I still wasn't able to query it using select *. I'm missing a key piece in the table structure.
Can you please help me so that the table can read my JSON correctly?
If required, I can modify the input JSON, if the double [[ is a problem.

1st: Row in a table should be represented in a file as single line, no multi-line JSON.
2nd: You can have array<some complex type> as a single column, but this is not convenient because you will need to explode the array to be able to access nested elements. The only reason you may want such structure is when there are really multiple rows with array<array<>>.
3rd: Everything in [] is an array. Everything in {} is a struct or map, in your case it is a struct, and you have missed this rule. Fields user, id and desc are inside struct, and struct is nested inside array. Array can have only type in it's definition, if it is nested struct, then it will be array<struct<...>>, If array is of simple type then, for example array<string>.
4th: Your JSON is not valid one because it contains extra comma after address value, fix it.
If you prefer to have single column colname containing array<array<struct<...>>> then create table like this:
CREATE EXTERNAL TABLE db1.user(
colname array<array<
struct<user:string,
id:string,
desc:string>
>>)...
And JSON file should look like this (single line for each row):
[[{"user": "ron","id": "17110","addr": "Some address"}, {"user": "harry","id": "42230","addr": "some other address"}]]
If the file contains single big array nested in another array, better remove [[ and ]], remove commas between structs and extra newlines inside structs. If single row is a struct {}, you can define your table without upper struct<>, only nested structs should be defined as struct<>:
CREATE EXTERNAL TABLE db1.user(
user string,
id string,
desc string
)...
Note, that in this case you do not need : between column name and type. Use : only inside nested structs.
And JSON should look like this (whole JSON object as defined in DDL in a single line, no comma between structs, each struct in a separate line):
{"user": "ron","id": "17110","addr": "Some address"}
{"user": "harry","id": "42230","addr": "some other address"}
Hope you got how it works. Read more in the JSONSerDe manual.

Related

To handle dynamic fields at scala end

scenario is like :
we are pulling data from mongo in JSON format and processing through spark .
At times we are not getting the desired field inside the complex datatype eg nested array of string or struct within array .
Is there any workaround while loading JSON file to put null values to the absent field.
(validator checks)
2.If want to handle dynamic nature at scala end, how it is suppose to be.
def checkAvailableColumns(df: DataFrame, expectedColumnsInput: List[String]) : DataFrame = {
expectedColumnsInput.foldLeft(df) {
(df,column) => {
if(df.columns.contains(column) == false) {
df.withColumn(column,lit("null"))
}
else (df)
}
}
}
I am using the above code to verify if columns are present in source side while comparing with the required column names ,if not present put null to that column.
Question here is how to get complex data type like array of struct into general column name so that i can compare it .
(i can using dot operator to pull column with struct but if that column doesn't exist my script will fail .
Take a look at the Scala Option class. Let's assume you have a
case class JsonTemplate(optionalArray: Option[Seq[String]])
And let's assume you get the valid json {}, the parser will put None as the value. And you will get the instance: JsonTemplate(None)

Split JSON into two individual JSON objects using Nifi

I have a JSON like
{
"campaign_key": 316,
"client_key": 127,
"cpn_mid_counter": "24",
"cpn_name": "Bopal",
"cpn_status": "Active",
"clt_name": "Bopal Ventures",
"clt_status": "Active"
}
Expected output
1st JSON :
{
"campaign_key": 316,
"client_key": 127,
"cpn_mid_counter": "24",
"cpn_name": "Bopal",
"cpn_status": "Active"
}
2nd JSON:
{
"clt_name": "Bopal Ventures",
"clt_status": "Active"
}
How do I acheive this by using NIFI? Thanks.
You can do what 'user' had said. The not-so-good thing about that approach is, if you number of fields are increasing, then you are required to add that many JSON Path expression attributes to EvaluateJsonPath and subsequently add that many attributes in ReplaceText.
Instead what I'm proposing is, use QueryRecord with Record Reader set to JsonTreeReader and Record Writer set to JsonRecordSetWriter. And add two dynamic relationship properties as follows:
json1 : SELECT campaign_key, client_key, cpn_mid_counter, cpn_name, cpn_status FROM FLOWFILE
json2 : SELECT clt_name, clt_status FROM FLOWFILE
This approach takes care of reading and writing the output in JSON format. Plus, if you want to add more fields, you just have add the field name in the SQL SELECT statement.
QueryRecord processor lets you execute SQL query against the FlowFile content. More details on this processor can be found here
Attaching screenshots
Karthik,
Use EvaluateJsonPath processor to get those all json Values by using its keys.
Example: $.campaign_key for gets compaign key value and $.clt_name for get clt name.
Like above one you can get all jsons.
Then use ReplaceText Processor for convert single json into two jsons.
{"Compaign_Key":${CompaignKey},...etc}
{"Clt_name":${clt_name}}
It will convert single json into two jsons.
Hope this helpful and let me know if you have issues.

Loading JSON file into BigQuery table: Schema changes

I am trying to load a json file into a BQ table. My schema looks something like:
{"_eventid": "1234", "Keywords":""}
{"_eventid": "4567", "Keywords":{"_text":"abcd"} }
From above, you can see that the schema changes for "Keywords." How do I deal with this? Using something like:
{
"name":"Keywords",
"type":"record",
"mode":"nullable",
"fields": [
{
"name":"_text",
"type":"string",
"mode":"nullable"
}
]
},
Only works for the second entry. For the first, I get the error:
Errors:
file-00000000: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. (error code: invalid)
JSON parsing error in row starting at position 0 at file: file- 00000000. Flat value specified for record field. Field: Keywords; Value: (error code: invalid)
Short Answer
Bigquery table is schema bounded. Whenever we try to ingest data which is not according to table schema we get error. In your case in the first record the value of Keywords is string but in the schema it is record with one nullable field which name is _text.
Workaround
You need to preprocess data before loading it to bigquery. If you have small json file you can write script and check if type of Keywords is record or string if it is string create the record first. So after preprocessing the file content would look like:
{"_eventid": "1234", "Keywords":{"_text": ""}}
{"_eventid": "4567", "Keywords":{"_text":"abcd"} }
According to your schema Keywords is nullable record. You can even ignore keywords which value is empty during preprocessing. After this step the input file would become.
{"_eventid": "1234"}
{"_eventid": "4567", "Keywords":{"_text":"abcd"} }
BigQuery now supports schema changes on load with
--schema_update_option=ALLOW_FIELD_ADDITION
--schema_update_option=ALLOW_FIELD_RELAXATION
options. See How to insert/append unstructured data to bigquery table for more details and examples with JSON loading.

Merging dynamic field with type date/string triggered conflict

I'm uploading json files on my Elasticsearch server and I have an object "meta" with a field name and a field value. Sometimes value is a string and sometimes is a date so the dynamic mapping doesn't work.
I tried to put an explicit mapping to set the field to string but I always have the same error "Merging dynamic updates triggered a conflict: mapper [customer.meta.value] of different type, current_type [string], merged_type [date]"}}}, :level=>:warn"
Can I use the parameter "ignore_conflict" or how can I upload multi type field?
Thx
You cannot have two data types for same field in elasticsearch. It is not possible to index it. Dynamic mapping means that the type is identified from the first value that is inserted into the field. If you try to insert some other type in that field, it'll be an error. If you need to store both string and date, your best bet is to set the mapping to use string and explicitly convert your dates to string before passing it to elasticsearch.
I disabled date_detection for _ default_ and that's working.
Now my problem is the following: I want to disable date_detection only for meta.value and customer.meta.value. It's correct for the first but I can't for the second because it's an nested object I think.
I tried this:
curl -XPUT 'localhost:9200/rr_sa' -d '
{
"mappings": {
"meta": {
"date_detection": false
},
"customer.meta": {
"date_detection": false
}
}
}
'

Get value of json string

[
{
"type": "spline",
"name": "W dor\u0119czeniu",
"color": "rgba(128,179,236,1)",
"mystring": 599,
"data": ...
}
]
I am trying to access this json as json['W doręczeniu']['mysting'], and I get no value why is that?
You're trying to access the index "W doręczeniu" but that's not an index it's a value. Also, what you seem to have is an array of JSON objects.
The [ at the start marks the array, the first element of which is your JSON object. The JSON obj begins with the {
You're also trying to use a [ ], but JSON values are accessed with the dot operator.
I'm not sure which index you're actually trying to access, but try something like this:
var x = json[0].mystring;
The value of "W doręczeniu" is not a key, so you cannot use it to get a value. Since your json string is an array you'll have to do json[0].nameto access the first (and only) element in the array, which happens to be the object. Of course, this is assuming json is the variable you store the array into.
var json = [{"type":"spline","name":"W dor\u0119czeniu","color":"rgba(128,179,236,1)","mystring":599}];
console.log(json[0].mystring); //should give you what you want.
EDIT:
To get the last element in a js array, you can simply do this:
console.log( json[json.length -1].mystring ); // same output as the previous example
'length - 1' because js arrays are indexed at 0. There's probably a million and one ways to dynamically get the array element you want, which are out of the scope of this question.