NiFi, flow with KafkaConsumer to write as json - json

currently I am stuck on the following problem:
I am reading messages from a Kafka Topic using KafkaConsumer. The messages are strings and have the following format:
{ "a" : "b", "a1" : "b1", "c2" : "c3" }
They are saved within the payload of the FlowFile.
I want to convert that string into json or ideally to csv, but cant figure out how to do it.
I am new to NiFi and researched as much as possible, but the answers I found were regarding conversions from json to avro or similar, but never string to json or avro.
I also found out that the Kafka message is in the payload of the FlowFile, not in the attributes, so I have no clue how to get my hands on it, since the examples are always involving the attributes.
So in short: Can I convert the payload of a FlowFile, which is a string, to json/cvs with some built-in processor.

if your message is in FlowFile, the following sequence could help:
1) Use AttributesToJson to convert payload message to Json.
2) Use EvaluateJsonPath to extract the payload message. In your case the kafka message. Then you can pass the extracted messages for csv generation.
This post can help to convert Json TO CSV: Convert Json To CSV

I ended up doing this:
ConsumeKafka gives me the string:
{ "a" : "b", "a1" : "b1" }
EvaluateJsonPath creates attributes by adding properties
a -> $.a //results in attribute named a with value b
a1 -> $.a1 //results in attribute named a1 with value b1
ReplaceText gets the attributes from EvaluateJsonPath to form one single csv formated:
Replacement value -> ${'a'},${'a1'}
This results as a single line, BUT NO NEW LINE:
b,b1
To add the new line appending \n, '\n', "\n" did not work.
What worked was pressing Shift+Enter while typing into the Replacement value field, which resulted in creating an empty new line.

Related

Extracting a JSON out of a string using JSONPath

I have Json data as follows:
{
"template" : [
"{
"Id": "abc"
}"
]
}
I am using JSONPath to extract data from the Json above. I would like to extract the "Id" data from the Json using JsonPath.
The problem I see is, the data is being treated as a string and not as a Json as shown below.
"{
"Id": "abc"
}"
If there were no double-quotes I could have used JsonPath as follows:
$.template[0].Id
But due to the double-quotes, I am unable to access the "Id" data. I suspect there is a way to access this data using JsonPath-Expression but I am pretty much a novice here and no amount of research helped me out with a resolution.
How do I treat it as a Json and not as a string using JsonPath? Kindly help me out here.
JSON Path isn't going to be able to parse JSON that's encoded within a string. You need to perform three operations:
Get the string (use JSON Path or something else)
Parse the string as JSON.
Get the data you're looking for on that (JSON Path or something else)

Processing JSON from a .txt file and converting to a DataFrame in Julia

Cross posting from Julia Discourse in case anyone here has any leads.
I’m just looking for some insight into why the below code is returning a dataframe containing just the first line of my json file. If you’d like to try working with the file I’m working with, you can download the aminer_papers_0.zip from the Microsoft Open Academic Graph site, I’m using the first file in that group of files.
using JSON3, DataFrames, CSV
file_name = "path/aminer_papers_0.txt"
json_string = read(file_name, String)
js = JSON3.read(json_string)
df = DataFrame([js])
The resulting DataFrame has just one line, but the column titles are correct, as is the first line. To me the mystery is why the rest isn’t getting processed. I think I can rule out that read() is only reading the first JSON object, because I can index into the resulting object and see many JSON objects:
enter image description here
My first guess was maybe the newline \n was causing escape issues, and tried to use chomp to get rid of them, but couldn’t get it to work.
Anyway - any help would be greatly appreciated!
I think the problem is that the file is in JSON Lines format, and the JSON3 library only returns the first valid JSON value that it finds at the start of a string unless told otherwise.
tl;dr
Call JSON3.read with the keyword argument jsonlines=true.
Why?
By default, JSON3 interprets a string passed to its read function as a single "JSON text", defined by RFC 8259 section 1.3.2:
A JSON text is a serialized value....
(My emphasis on the use of the indefinite singular article "a.") A "JSON value" is defined in section 1.3.3:
A JSON value MUST be an object, array, number, or string, or one of the following three literal names: false, null, true.
A string with multiple JSON values in it is technically multiple "JSON texts." It is up to the parser to determine what part of the string argument you give it is a JSON text, and the authors of JSON3 chose as the default behavior to parse from the start of the string to the end of the first valid JSON value.
In order to get JSON3 to read the string as multiple JSON values, you have to give it the keyword option jsonlines=true, which is documented as:
jsonlines: A Bool indicating that the json_str contains newline delimited JSON strings, which will be read into a JSON3.Array of the JSON values. See jsonlines for reference. [default false]
Example
Take for example this simple string:
two_values = "3.14\n2.72"
Each one of these lines is a valid JSON serialization of a number. However, when passed to JSON3.read, only the first is parsed:
using JSON3
#assert JSON3.read(two_values) == 3.14
Using jsonlines=true, both values are parsed and returned as a JSON3.Array struct:
#assert JSON3.read(two_values, jsonlines=true) == [3.14, 2.72]
Other Packages
The JSON.jl library, which people might use by default given the name, does not implement parsing of JSON Lines strings at all, leaving it up to the caller to properly split the string as needed:
using JSON
JSON.parse(two_values)
# ERROR: Expected end of input
# Line: 1
# Around: ...3.14 2.72...
# ^
A simple way to implement reading multiple values is to use eachline:
#assert [JSON.parse(line) for line in eachline(IOBuffer(two_values))] == [3.14, 2.72]

escaping dots in json evaluation in WSO2 Datamapper

I have a JSON object as payload, which contains a dot (".") in one of the identifier names and I want to map this object to another JSON object using the datamapper mediator.
The problem I am facing is that the JSON evaluation uses the dot notation for nested elements. The field "example":
{ "a": { "b": "example"} }
is evaluated by asking for a.b
My object however looks like:
{ "a": { "b.c": "example"} }
I cannot evaluate a.b.c, because it thinks b and c are two seperate nested elements.
Escaping this identifier name in the datamapper.dmc javascript code does not seem to work. No matter what I try ('', "", [''], [""]) I get the error:
Error while reading input stream. Script engine unable to execute the script javax.script.ScriptException: <eval>:8:43 Expected ident but found [
This may not be exact solution as I did not specifically tried it for Data Mapper, but I had similar problem in WSO2 Property Mediator while parsing incoming JSON to get a value and set it to property. I was able to parse such JSON using following syntax
json-eval($.A.['b.c'])
Where 'A' is JSON object containing 'b.c' JSON element.
I saw you mentioned that you already tried something similar, but just wanted to give my working example in case it helps.

How can I validate Json schema in spark 2.X?

Using Spark streaming (written in Scala) to read messages from Kafka.
The messages are all Strings in Json format.
Defining the expected schema in a local variable expectedSchema
then parsing the Strings in the RDD to Json
spark.sqlContext.read.schema(schema).json(rdd.toDS())
The problem: Spark will process all the records/rows as long as it has some fields that I try to read, even if the actual Json format (i.e schema) of the input row (String) doesn't match my expectedSchema.
Assume expected schema looks like this (in Json): {"a": 1,"b": 2, "c": 3}
and input row looks like this: {"a": 1, "c": 3}
Spark will process the input without failing.
I tried using the solution described here: How do I apply schema with nullable = false to json reading
but assert(readJson.schema == expectedSchema) never fails, even when I deliberately send input rows with wrong Json schema.
Is there a way for me to verify that the actual schema of a given input row, matches my expected schema?
Is there a way for me to insert a null value to "fill" fields missing from "corrupt" schema row?

Parsing large JSON file with Scala and JSON4S

I'm working with Scala in IntelliJ IDEA 15 and trying to parse a large twitter record json file and count the total number of hashtags. I am very new to Scala and the idea of functional programming. Each line in the json file is a json object (representing a tweet). Each line in the file starts like so:
{"in_reply_to_status_id":null,"text":"To my followers sorry..
{"in_reply_to_status_id":null,"text":"#victory","in_reply_to_screen_name"..
{"in_reply_to_status_id":null,"text":"I'm so full I can't move"..
I am most interested in a property called "entities" which contains a property called "hastags" with a list of hashtags. Here is an example:
"entities":{"hashtags":[{"text":"thewayiseeit","indices":[0,13]}],"user_mentions":[],"urls":[]},
I've browsed the various scala frameworks for parsing json and have decided to use json4s. I have the following code in my Scala script.
import org.json4s.native.JsonMethods._
var json: String = ""
for (line <- io.Source.fromFile("twitter38.json").getLines) json += line
val data = parse(json)
My logic here is that I am trying to read each line from twitter38.json into a string and then parse the entire string with parse(). The parse function is throwing an error claiming:
"Type mismatch, expected: Nothing, found:String."
I have seen examples that use parse() on strings that hold json objects such as
val jsontest =
"""{
|"name" : "bob",
|"age" : "50",
|"gender" : "male"
|}
""".stripMargin
val data = parse(jsontest)
but I have received the same error. I am coming from an object oriented programming background, is there something fundamentally wrong with the way I am approaching this problem?
You have most likely incorrectly imported dependencies to your Intellij project or modules into your file. Make sure you have the following lines imported:
import org.json4s.native.JsonMethods._
Even if you correctly import this module, parse(String: json) will not work for you, because you have incorrectly formed a json. Your json String will look like this:
"""{"in_reply_...":"someValue1"}{"in_reply_...":"someValues2"}"""
but should look as follows to be a valid json that can be parsed:
"""{{"in_reply_...":"someValue1"},{"in_reply_...":"someValues2"}}"""
i.e. you need starting and ending brackets for the json, and a comma between each line of tweets. Please read the json4s documenation for more information.
Although being almost 6 years old, I think this question deserves another try.
JSON format has a few misunderstandings in people's minds, especially how they are stored and how they are read back.
JSON documents, are stored as either a single object having all the other fields, or an array of multiple object possibly in same format. this second part is important because arrays in almost every programming language are defined by angle brackets and values separated by commas (note here I used a person object as my single value):
[
{"name":"John","surname":"Doe"},
{"name":"Jane","surname":"Doe"}
]
also note that everything except brackets, numbers and booleans are enclosed in quotes when written into file.
however, there is another use that is not official but preferred to transfer datasets easily where every object, or document as in nosql/mongo language, are stored in a new line like this:
{"name":"John","surname":"Doe"}
{"name":"Jane","surname":"Doe"}
so for the question, OP has a document written in this second form, but tries an algorithm written to read the first form. following code has few simple changes to achieve this, and the user must read the file knowing that:
var json: String = "["
for (line <- io.Source.fromFile("twitter38.json").getLines) json += line + ","
json=json.splitAt(json.length()-1)._1
json+= "]"
val data = parse(json)
PS: although #sbrannon, has the correct idea, the example he/she gave has mistakenly curly braces instead of angle brackets to surround the data.
EDIT: I have added json=json.splitAt(json.length()-1)._1 because the code above ends with a trailing comma which will cause parse error per the JSON format definition.