How to properly merge multiple FlowFile's? - json

I use MergeContent 1.3.0 in order to merge FlowFiles from 2 sources: 1) from ListenHTTP and 2) from QueryElasticsearchHTTP.
The problem is that the merging result is a list of JSON strings. How can I convert them into a single JSON string?
{"event-date":"2017-08-08T00:00:00"}{"event-date":"2017-02-23T00:00:00"}{"eid":1,"zid":1,"latitude":38.3,"longitude":2.4}
I would to get this result:
{"event-date":["2017-08-08T00:00:00","2017-02-23T00:00:00"],"eid":1,"zid":1,"latitude":38.3,"longitude":2.4}
Is it possible?
UPDATE:
After changing data structure in Elastic, I was able to come up with the following output result of MergeContent. Now I have a common field eid in both JSON strings. I would like to merge these strings by eid in order to get a single JSON file. Which operator should I use?
{"eid":"1","zid":1,"latitude":38.3,"longitude":2.4}{"eid":"1","dates":{"event-date":["2017-08-08","2017-02-23"]}}
I need to get the following output:
{"eid":"1","zid":1,"latitude":38.3,"longitude":2.4,"dates":{"event-date":["2017-08-08","2017-02-23"]}}
It was suggested to use ExecuteScript to merge files. However I cannot figure out how to do this. This is what I tried:
import json
import java.io
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback
class ModJSON(StreamCallback):
def __init__(self):
pass
def process(self, inputStream, outputStream):
text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
obj = json.loads(text)
newObj = {
"eid": obj['eid'],
"zid": obj['zid'],
...
}
outputStream.write(bytearray(json.dumps(newObj, indent=4).encode('utf-8')))
flowFile1 = session.get()
flowFile2 = session.get()
if (flowFile1 != None && flowFile2 != None):
# WHAT SHOULD I PUT HERE??
flowFile = session.write(flowFile, ModJSON())
flowFile = session.putAttribute(flowFile, "filename", flowFile.getAttribute('filename').split('.')[0]+'_translated.json')
session.transfer(flowFile, REL_SUCCESS)
session.commit()

The example how to read multiple files from incoming queue using filtering
Assume you have multiple pairs of flow files with following content:
{"eid":"1","zid":1,"latitude":38.3,"longitude":2.4}
and
{"eid":"1","dates":{"event-date":["2017-08-08","2017-02-23"]}}
The same value of eid field provides a link between pairs.
Before merging we have to extract the value of eid field and put it into na attribute of the flow file for fast filtering.
Use the EvaluateJsonPath processor with properties:
Destination : flowfile-attribute
eid : $.eid
After this you'll have new eid attribute of the flow file.
Then use ExecuteScript processor with groovy language and with following code:
import org.apache.nifi.processor.FlowFileFilter;
import groovy.json.JsonSlurper
import groovy.json.JsonBuilder
//get first flow file
def ff0 = session.get()
if(!ff0)return
def eid = ff0.getAttribute('eid')
//try to find files with same attribute in the incoming queue
def ffList = session.get(new FlowFileFilter(){
public FlowFileFilterResult filter(FlowFile ff) {
if( eid == ff.getAttribute('eid') )return FlowFileFilterResult.ACCEPT_AND_CONTINUE
return FlowFileFilterResult.REJECT_AND_CONTINUE
}
})
//let's assume you require two additional files in queue with the same attribute
if( !ffList || ffList.size()<1 ){
//if less than required
//rollback current session with penalize retrieved files so they will go to the end of the incoming queue
//with pre-configured penalty delay (default 30sec)
session.rollback(true)
return
}
//let's put all in one list to simplify later iterations
ffList.add(ff0)
if( ffList.size()>2 ){
//for example unexpected situation. you have more files then expected
//redirect all of them to failure
session.transfer(ffList, REL_FAILURE)
return
}
//create empty map (aka json object)
def json = [:]
//iterate through files parse and merge attributes
ffList.each{ff->
session.read(ff).withStream{rawIn->
def fjson = new JsonSlurper().parse(rawIn)
json.putAll(fjson)
}
}
//create new flow file and write merged json as a content
def ffOut = session.create()
ffOut = session.write(ffOut,{rawOut->
rawOut.withWriter("UTF-8"){writer->
new JsonBuilder(json).writeTo(writer)
}
} as OutputStreamCallback )
//set mime-type
ffOut = session.putAttribute(ffOut, "mime.type", "application/json")
session.remove(ffList)
session.transfer(ffOut, REL_SUCCESS)

Joining together two different types of data is not really what MergeContent was made to do.
You would need to write a custom processor, or custom script, that understood your incoming data formats and created the new output.
If you have ListenHttp connected to QueryElasticSearchHttp, meaning that you are triggering the query based on the flow file coming out of ListenHttp, then you may want to make a custom version of QueryElasticSearchHttp that takes the content of the incoming flow file and joins it together with any of the outgoing results.
Here is where the query result is currently written to a flow file:
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-elasticsearch-bundle/nifi-elasticsearch-processors/src/main/java/org/apache/nifi/processors/elasticsearch/QueryElasticsearchHttp.java#L360
Another option is to use ExecuteScript and write a script that could take multiple flow files and merge them together in the way you described.

Related

Read json file in jsr223 sampler in jmeter and extract data

import com.jayway.jsonpath.JsonPath
def idCSV = new File('id.csv')
def index = [fileOne.json, fileTwo.json]
def jsonString
index.each { file ->
jsonString = ________
def ids = JsonPath.read(jsonString, '$..id')
ids.each { id ->
idCSV << id << newLine
}
}
How to fill the jsonString = ____, so that I can json file into string and parse the string to extract ids and some information from the json string.
And I don't to do it in http request-> GET-> file format.
Previously i have extraced jsonString from http response and it worked well now I want to do it this way.
Use JsonSlurper:
def jsonString = new groovy.json.JsonSlurper().parseText(new File("json.txt").text)
My expectation is that you're looking for File.getText() function
jsonString = file.text
I have no full vision why do you need to store the values from JSON in a CSV file, however there is an alternative way of achieving this which doesn't require scripting as your approach will work with 1 concurrent thread only, if you will add more users attempting writing into the same file - you'll run into a race condition :
You can read the files from the folder into JMeter Variables via Directory Listing Config
The file can be read using HTTP Request sampler
The values cane be fetched using JSON Extractor, they will be automatically stored into JMeter Variables so you will able to use them later on
If you need the values to be present in the file (although I wouldn't recommend this approach cause it will cause massive disk IO and potentially can run your test) you can go for the Flexible File Writer

GCP Proto Datastore encode JsonProperty in base64

I store a blob of Json in the datastore using JsonProperty.
I don't know the structure of the json data.
I am using endpoints proto datastore in order to retrieve my data.
The probleme is the json property is encoded in base64 and I want a plain json object.
For the example, the json data will be:
{
first: 1,
second: 2
}
My code looks something like:
import endpoints
from google.appengine.ext import ndb
from protorpc import remote
from endpoints_proto_datastore.ndb import EndpointsModel
class Model(EndpointsModel):
data = ndb.JsonProperty()
#endpoints.api(name='myapi', version='v1', description='My Sample API')
class DataEndpoint(remote.Service):
#Model.method(path='mymodel2', http_method='POST',
name='mymodel.insert')
def MyModelInsert(self, my_model):
my_model.data = {"first": 1, "second": 2}
my_model.put()
return my_model
#Model.method(path='mymodel/{entityKey}',
http_method='GET',
name='mymodel.get')
def getMyModel(self, model):
print(model.data)
return model
API = endpoints.api_server([DataEndpoint])
When I call the api for getting a model, I get:
POST /_ah/api/myapi/v1/mymodel2
{
"data": "eyJzZWNvbmQiOiAyLCAiZmlyc3QiOiAxfQ=="
}
where eyJzZWNvbmQiOiAyLCAiZmlyc3QiOiAxfQ== is the base64 encoded of {"second": 2, "first": 1}
And the print statement give me: {u'second': 2, u'first': 1}
So, in the method, I can explore the json blob data as a python dict.
But, in the api call, the data is encoded in base64.
I expeted the api call to give me:
{
'data': {
'second': 2,
'first': 1
}
}
How can I get this result?
After the discussion in the comments of your question, let me share with you a sample code that you can use in order to store a JSON object in Datastore (it will be stored as a string), and later retrieve it in such a way that:
It will show as plain JSON after the API call.
You will be able to parse it again to a Python dict using eval.
I hope I understood correctly your issue, and this helps you with it.
import endpoints
from google.appengine.ext import ndb
from protorpc import remote
from endpoints_proto_datastore.ndb import EndpointsModel
class Sample(EndpointsModel):
column1 = ndb.StringProperty()
column2 = ndb.IntegerProperty()
column3 = ndb.StringProperty()
#endpoints.api(name='myapi', version='v1', description='My Sample API')
class MyApi(remote.Service):
# URL: .../_ah/api/myapi/v1/mymodel - POSTS A NEW ENTITY
#Sample.method(path='mymodel', http_method='GET', name='Sample.insert')
def MyModelInsert(self, my_model):
dict={'first':1, 'second':2}
dict_str=str(dict)
my_model.column1="Year"
my_model.column2=2018
my_model.column3=dict_str
my_model.put()
return my_model
# URL: .../_ah/api/myapi/v1/mymodel/{ID} - RETRIEVES AN ENTITY BY ITS ID
#Sample.method(request_fields=('id',), path='mymodel/{id}', http_method='GET', name='Sample.get')
def MyModelGet(self, my_model):
if not my_model.from_datastore:
raise endpoints.NotFoundException('MyModel not found.')
dict=eval(my_model.column3)
print("This is the Python dict recovered from a string: {}".format(dict))
return my_model
application = endpoints.api_server([MyApi], restricted=False)
I have tested this code using the development server, but it should work the same in production using App Engine with Endpoints and Datastore.
After querying the first endpoint, it will create a new Entity which you will be able to find in Datastore, and which contains a property column3 with your JSON data in string format:
Then, if you use the ID of that entity to retrieve it, in your browser it will show the string without any strange encoding, just plain JSON:
And in the console, you will be able to see that this string can be converted to a Python dict (or also a JSON, using the json module if you prefer):
I hope I have not missed any point of what you want to achieve, but I think all the most important points are covered with this code: a property being a JSON object, store it in Datastore, retrieve it in a readable format, and being able to use it again as JSON/dict.
Update:
I think you should have a look at the list of available Property Types yourself, in order to find which one fits your requirements better. However, as an additional note, I have done a quick test working with a StructuredProperty (a property inside another property), by adding these modifications to the code:
#Define the nested model (your JSON object)
class Structured(EndpointsModel):
first = ndb.IntegerProperty()
second = ndb.IntegerProperty()
#Here I added a new property for simplicity; remember, StackOverflow does not write code for you :)
class Sample(EndpointsModel):
column1 = ndb.StringProperty()
column2 = ndb.IntegerProperty()
column3 = ndb.StringProperty()
column4 = ndb.StructuredProperty(Structured)
#Modify this endpoint definition to add a new property
#Sample.method(request_fields=('id',), path='mymodel/{id}', http_method='GET', name='Sample.get')
def MyModelGet(self, my_model):
if not my_model.from_datastore:
raise endpoints.NotFoundException('MyModel not found.')
#Add the new nested property here
dict=eval(my_model.column3)
my_model.column4=dict
print(json.dumps(my_model.column3))
print("This is the Python dict recovered from a string: {}".format(dict))
return my_model
With these changes, the response of the call to the endpoint looks like:
Now column4 is a JSON object itself (although it is not printed in-line, I do not think that should be a problem.
I hope this helps too. If this is not the exact behavior you want, maybe should play around with the Property Types available, but I do not think there is one type to which you can print a Python dict (or JSON object) without previously converting it to a String.

Import CSV into google cloud datastore

I have a CSV file with 2 columns and 20,000 rows I would like to import into Google Cloud Datastore. I'm new to the Google Cloud and NoSQL databases. I have tried using dataflow but need to provide a Javascript UDF function name. Does anyone have an example of this? I will be querying this data once it's in the datastore.
Any advice or guidance on how to create this would be appreciated.
Using Apache Beam, you can read a CSV file using the TextIO class. See the TextIO documentation.
Pipeline p = Pipeline.create();
p.apply(TextIO.read().from("gs://path/to/file.csv"));
Next, apply a transform that will parse each row in the CSV file and return an Entity object. Depending on how you want to store each row, construct the appropriate Entity object. This page has an example of how to create an Entity object.
.apply(ParDo.of(new DoFn<String, Entity>() {
#ProcessElement
public void processElement(ProcessContext c) {
String row = c.element();
// TODO: parse row (split) and construct Entity object
Entity entity = ...
c.output(entity);
}
}));
Lastly, write the Entity objects to Cloud Datastore. See the DatastoreIO documentation.
.apply(DatastoreIO.v1().write().withProjectId(projectId));
Simple in python, but can easily adapt to other langauges. Use the split() method to loop through the lines and comma-separated values:
from google.appengine.api import urlfetch
from my.models import MyModel
csv_string = 'http://someplace.com/myFile.csv'
csv_response = urlfetch.fetch(csv_string, allow_truncated=True)
if csv_response.status_code == 200:
for row in csv_response.content.split('\n'):
row_values = row.split(',')
# csv values are strings. Cast them if they need to be something else
new_entry = MyModel(
property1 = row_values[0],
property2 = row_values[1]
)
new_entry.put()
else:
print 'cannot load file: {}'.format(csv_string)

NIFI: get value from json

i have a queryCassandra which generate json like this one:
{"results":[{"term":"term1"},{"term":"term2"}..]}
Now, i want to get from this all the term values separated by some separator in string format; ex :
term1,term2,term3
So i can pass this list as a string parameter for a java main program which i've alreat set.
(i only need the transofrmation, not the java program execution)
Thank you !
You can easily get those values by using following ways.
GetFile-->EvaluateJsonPath-->PutFile
In get file you have to specify location of json file.
In EvaluateJsonPath configure like following properties.,
Destination:flowfile-attribute
Return Type:json
input.term1:$.results.[0].term //To get term
input.term2:$.results.[1].term
At the result of Evaluate json you have two attributes in which having those values.
Result attributes:
input.term1: term1
input.term2: term2
Above code works for me,so feel free to upvote/accept as answer.
as a variant use ExecuteScript with groovy lang:
import groovy.json.*
//get input file
def flowFile = session.get()
if(!flowFile)return
//parse json to map/array objects
def content = session.read(flowFile).withStream{ stream-> return new JsonSlurper().parse( stream ) }
//transform
content = content.results.collect{ it.term }.join(',')
//write new content
flowFile = session.write(flowFile,{ stream->
stream << content.getBytes("UTF-8")
} as OutputStreamCallback)
session.transfer(flowFile, REL_SUCCESS)

Python3: Loop over objects and add attributes to an array or object

I'm trying to loop through some objects and add certain attributes to an array to be sent back as JSON to the view:
data = {}
camera_logs = CameraLog.objects.filter(camera_id=camera_id)
for log in camera_logs :
setattr(data, 'celsius', log.celsius)
setattr(data, 'fahrenheit', log.fahrenheit)
return JsonResponse(data)
I'm quite new to Python, so I'm not sure if I'm even on the right track.
Accessing Python dictionaries is much easier than using setattr.
You have two options. Either create a dictionary with IDs as Keys or just a simple list.
import json
data = {}
camera_logs = CameraLog.objects.filter(camera_id=camera_id)
for log in camera_logs :
data[log.id] = log.celsius
return Response(json.dumps(data))
in above solution you will have a dictionary with ID of CameraLog as a KEY, and as value you will have celsius. Basically your json will look like this:
{
1: 20,
2: 19,
3: 21
}
Second approach is to send a simple list of values, but I guess you would like to have info, what camera had what temp
import json
data = []
camera_logs = CameraLog.objects.filter(camera_id=camera_id)
for log in camera_logs :
data.append(log.celsius)
return Response(json.dumps(data))
Edit to an answer
If you wish to have a list of dicts, make something like this:
import json
data = []
camera_logs = CameraLog.objects.filter(camera_id=camera_id)
for log in camera_logs :
data.append({
'camera_id': log.id,
'celsius': log.celsius,
'fahrenheit': log.fahrenheit
})
return Response(json.dumps(data))
You can enhance your query by only selecting the attributes that you need from a queryset using .values_list.
import json
camera_logs = CameraLog.objects.filter(camera_id=camera_id).values_list(
'celsius', 'fahrenheit')
data = [{"celcius": cel, "fahrenheit": fahr} for cel, fahr in camera_logs]
return Response(json.dumps(data))