NiFi JsonRecordSetWriter skipping a nested field - json

I am pulling messages out of a Kafka topic into NiFi, and am seeing problems with the JsonRecordSetWriter not outputting a nested field. If I switch from that writer to a CSV writer I don't have the same problem, but both the Avro and XML writers have the same problem, so it's a problem with the inferred schema, I believe.
Here is my simplified input:
{
"data": [
{
"object": {
"extensions": {
"field1": "TS",
"field2": "howdy"
}
}
},
{
"object": {
"extensions": {
"field1": "TT",
"field3": "something"
}
}
}
]
}
And the output:
[ {
"data" : [ {
"object" : {
"extensions" : {
"field1" : "TS",
"field2" : "howdy"
}
}
}, {
"object" : {
"extensions" : {
"field1" : "TT",
"field2" : null
}
}
} ]
} ]
If I use the CSV writer I get field1 and 2 for the first record and field 1 and 3 for the second record, so the JsonTreeReader is getting the data read from Kafka correctly, it's the JsonRecordSetWriter that isn't writing it out right. Looks like the schema inference engine is reading the first record in the array for it's schema, and then outputting based on that. Field2 is output regardless that it doesn't exist in record2, and field3 is ignored since it didn't exist in record1.
Any suggestions from folks who know more than I?
Thanks in advance for any assistance!

I solved my problem, but didn't figure out why it was happening, unfortunately. The way I solved it was to switch from the ConsumeKafkaRecord processor to the ConsumeKafka processor, which doesn't use the JsonTreeReader and JsonRecordSetWriter controllers to process Json. Since the values I was pulling out of Kafka were already in Json, getting them as a string and going from there (adding an application/json mime.type to ensure they were correctly treated) worked just fine for me.
The problem could also have been solved by creating a schema which contained every field possible, but that would have resulted in a lot of null fields, as the records are sparsely populated - and it was further complicated by the fact that one of my fields started with a symbol, and NiFi uses Avro-formatted schemas everywhere, so I would have had to work around that (a bug fix in NiFi 1.7 allows it, but it has limitations also).
So I'm on my way - not sure if this experience will help others, but if it does, great!

I had almost the same problem with a transformRecord with common jsons. In short, the flow listed an entry in an application event s3 and transformed the jonsons of those events into 2 types:
1 - For an athena table.
2 - For an HTTP call.
Each event file contained more than 200 events, and in the middle of these events it contained one that had a nesting called "traits" and the information from that traits never came (it was an event that rarely came in this "set" of events, about 1 to 2% of the events in the archive, but they are extremely important).
I had to modify the beginning of my flow to make it work.
The flow was:
ListS3 -> FetchS3Object -> JoltTransformRecord -> SplitJson -> EvaluateJsonPath
And stayed:
ListS3 -> FetchS3Object -> ConvertRecord -> SplitJson -> JoltTransformJson -> EvaluateJsonPath
And now it works perfectly.
The #Chrick solution helped me a lot to identify what the error was and, as my problem is a little different, I hope this can help someone.

Related

JSON Deserialization on Talend

Trying to figuring out how to deserialize this kind of json in talend components :
{
"ryan#toofr.com": {
"confidence":119,"email":"ryan#toofr.com","default":20
},
"rbuckley#toofr.com": {
"confidence":20,"email":"rbuckley#toofr.com","default":15
},
"ryan.buckley#toofr.com": {
"confidence":18,"email":"ryan.buckley#toofr.com","default":16
},
"ryanbuckley#toofr.com": {
"confidence":17,"email":"ryanbuckley#toofr.com","default":17
},
"ryan_buckley#toofr.com": {
"confidence":16,"email":"ryan_buckley#toofr.com","default":18
},
"ryan-buckley#toofr.com": {
"confidence":15,"email":"ryan-buckley#toofr.com","default":19
},
"ryanb#toofr.com": {
"confidence":14,"email":"ryanb#toofr.com","default":14
},
"buckley#toofr.com": {
"confidence":13,"email":"buckley#toofr.com","default":13
}
}
This JSON comes from the Toofr API where documentation can be found here .
Here the actual sitation :
For each line retreived in the database, I call the API and I got this (the first name, the last name and the company change everytime.
Does anyone know how to modify the tExtractJSONField (or use smthing else) to show the results in tLogRow (for each line in the database) ?
Thank you in advance !
EDIT 1:
Here's my tExtractJSONfields :
When using tExtractJSONFields with XPath, you need
1) a valid XPath loop point
2) valid XPath mapping to your structure relative to the loop path
Also, when using XPath with Talend, every value needs a key. The key cannot change if you want to loop over it. Meaning this is invalid:
{
"ryan#toofr.com": {
"confidence":119,"email":"ryan#toofr.com","default":20
},
"rbuckley#toofr.com": {
"confidence":20,"email":"rbuckley#toofr.com","default":15
},
but this structure would be valid:
{
"contact": {
"confidence":119,"email":"ryan#toofr.com","default":20
},
"contact": {
"confidence":20,"email":"rbuckley#toofr.com","default":15
},
So with the correct data the loop point might be /contact.
Then the mapping for Confidence would be confidence (the name from the JSON), the mapping for Email would be email and vice versa for default.
EDIT
JSONPath has a few disadvantages, one of them being you cannot go higher up in the hierarchy. You can try finding out the correct query with jsonpath.com
The loop expression could be $.*. I am not sure if that will satisfy your need, though - it has been a while since I've been using JSONPath in Talend because of the downsides.
I have been ingesting some complex json structures and did this via minimal json libraries, and tjava components within talend.

getDegree()/isOutgoing() funcitons don't work in graphAware/neo4j-to-elasticsearch mapping.json

Neo4j Version: 3.2.2
Operating System: Ubuntu 16.04
I use getDegree() function in mapping.json file, but the return would always be null; I'm using the dataset neo4j tutorial Movie/Actor dataset.
Output from elasticsearch request
mapping.json
{
"defaults": {
"key_property": "uuid",
"nodes_index": "default-index-node",
"relationships_index": "default-index-relationship",
"include_remaining_properties": true
},
"node_mappings": [
{
"condition": "hasLabel('Person')",
"type": "getLabels()",
"properties": {
"getDegree": "getDegree()",
"getDegree(type)": "getDegree('ACTED_IN')",
"getDegree(direction)": "getGegree('OUTGOING')",
"getDegree('type', 'direction')": "getDegree('ACTED_IN', 'OUTGOING')",
"getDegree-degree": "degree"
}
}
],
"relationship_mappings": [
{
"condition": "allRelationships()",
"type": "type",
}
]
}
Also if I use isOutgoing(), isIncoming(), otherNode function in relationship_mappings properties part, elasticsearch would never load the relationship data from neo4j. I think I probably have some misunderstanding of this sentence only when one of the participating nodes "looking" at the relationship is provided on this page https://github.com/graphaware/neo4j-framework/tree/master/common#inclusion-policies
mapping.json
{
"defaults": {
"key_property": "uuid",
"nodes_index": "default-index-node",
"relationships_index": "default-index-relationship",
"include_remaining_properties": true
},
"node_mappings": [
{
"condition": "allNodes()",
"type": "getLabels()"
}
],
"relationship_mappings": [
{
"condition": "allRelationships()",
"type": "type",
"properties": {
"isOutgoing": "isOutgoing()",
"isIncomming": "isIncomming()",
"otherNode": "otherNode"
}
}
]
}
BTW, is there any page that list all of the functions that we can use in mapping.json? I know two of them
github.com/graphaware/neo4j-framework/tree/master/common#inclusion-policies
github.com/graphaware/neo4j-to-elasticsearch/blob/master/docs/json-mapper.md
but it seems there are more, since I can use getType(), which hasn't been listed in any of the above pages.
Please let me know if I can provide any further help to solve the problem
Thanks!
The getDegree() function is not available to use, in contrary to getType(). I will explain why :
When the mapper (the part responsible to create a node or relationship representation as ES document ) is doing its job, it receive a DetachedGraphObject being a detached node or relationship.
The meaning of detached is that it is happening outside of a transaction and thus query operations are not available against the database anymore. The getType() is available because it is part of the relationship metadata and it is cheap, however if we would want to do the same for getDegree() this can be seriously more costly during the DetachedObject creation (which happen in a tx) depending on the number of different types etc.
This is however something we are working on, by externalising the mapper in a standalone java application coupled with a broker like kafa, rabbit,.. between neo and this app. We would not, however offer the possibilty to requery the graph in the current version of the module as it can have serious performance impacts if the user is not very careful.
As last, the only suggestion I can give you is to keep a property on your node with the updates of degrees you need to replicate to ES.
UPDATE
Regarding this part of the documentation :
For Relationships only when one of the participating nodes "looking" at the relationship is provided:
This is used only when not using the json definition, so you can use one or the other. the json definition has been added later as addition and both cannot be used together.
For answering this part, it means that the nodes of the incoming or outgoing side, depending on the definition, should be included in the inclusion policy for nodes, like hasLabel('Employee') || hasProperty('form') || getProperty('age', 0) > 20 . If you have an allNodes policy then it is fine.

Node.js SOAP client parameter formatting

I'm having trouble properly formatting one particular soap parameter using the node-soap module for node.js as a client, to a 3rd-party SOAP service.
The client.describe() for this method says this particular input should be in the shape of:
params: { 'param[]': {} }
I have tried a bunch of different JSON notations to try to fit my data to that shape.
Examples of formats that do NOT work:
"params": { "param": [ {"myParameterName": "myParameterValue"} ] }
"params": [ "param": { "name": "myParameterName", "_": "myParameterValue"} ]
"params": { "param" : [ {"name": "myParameterName", "_": "myParameterValue"} ] }
"params": { "param[]": {"myParameterName": "myParameterValue" } }
"params": { "param[myParameterName]": {"_": "myParameterValue" } }
I must be overlooking something, and I suspect I'm going to feel like Captain Obvious when some nice person points out what I'm doing wrong.
Here is what DOES work, using other soap clients, and how they handle the "named parameter with a value"
soapUI for this method successfully accepts this particular input via XML in the shape of:
<ns:params>
<ns:param name="myParameterName">myParameterValue</ns:param>
</ns:params>
Also, using PHP, I can successfully make the call by creating a stdClass of arrays like so:
$parms = new stdClass;
$parms->param = array(
array(
"name"=>"myParameterName","_"=>"myParameterValue"
)
);
and then eventually passing
'params' => $parms
to the PHP soap client
Many thanks!
To get a better look at what XML was being generated by node-soap, I added a console.log(message) statement to the node_modules/soap/lib/client.js after the object-to-XML encoding. I then began experimenting with various JSON structures to figure out empirically how they were mapping to XML structures.
I found a JSON structure for node-soap to generate the XML in my 3rd-party's required named-parameter-with-value format. I was completely unaware of the "$value" special keyword. Looks like this may have been added in the 0.4.6 release from mid-June 2014. See the change history
"params": [
{
"param": {
"attributes": {
"name": "myParameterName"
},
$value: "myParameterValue"
}
}
]
(note the outer array, which gives me the luxury of specifying multiple "param" entries, which is sometimes needed by this particular 3rd-party API)
generates this XML:
<tns:params>
<tns:param name="myParameterName">myParameterValue</tns:param>
</tns:params>
which perfectly matches the structure in soapUI (which I already knew worked) of:
<ns:params>
<ns:param name="myParameterName">myParameterValue</ns:param>
</ns:params>

sails.js mongodb rest api update issue

I am new to sails.js and mongodb. I found something strange. When i use rest api to update the record in mongodb, after record updated, the json format changed.
For example. Originally I have a record like this:
{
creator: "John",
taskname: "test",
id: "53281a5d709602dc17b000cd"
}
After clicking http://127.xxx:1337/testtask/update/53281a5d709602dc17b000cd?creator=default%20creator, following json returned.
The json field is sorted in alphabetic order.
How can i keep the origin format of the json file? Is it a bug? Is there any workaround?
{
createdAt: "2014-03-18T10:05:17.052Z",
creator: "default creator",
taskname: "test",
updatedAt: "2014-03-18T10:08:53.067Z",
id: "53281a5d709602dc17b000cd"
}
Thanks.
The problem is fields in JSON objects don't have any concept of order. A JSON object is a dictionary, or in other words just some key/value pairs. This means that this JSON:
{ "a" : "some string", "b" : "other string" }
is logically equivalent to this JSON:
{ "b" : "other string", "a" : "some string" }
If you want to preserve ordering in your JSON data there are other ways to do it. For example JSON arrays do preserve order so something like this would work:
[ { "a" : "some string" }, { "b" : "other string" } ]
Internally MongoDB may actually preserve the ordering, but that's an implementation detail and you can't depend on it.
More detail on what Mongo is doing here.
Much like the "other" framework that inspired this, there is some automatic time-stamp generation happening in your models when things are updated or created. You wouldn't be the first. Ruby people have been trapped by this for years.
There are options you can define on your collection objects to remove these fields. This comes from the Waterline documentation, which should be the manager in use.
So in addition to attributes:
autoCreatedAt: false,
autoUpdatedAt: false,
attributes: {
// normal things here
},
Of course you will need to remove any of these properties that have been created in your documents manually. See the $unset operator for doing this

Solr Mulivalued Problem

Consider The following is the json response i'm getting from the solr if i use multivalued = true for the fields.
{
"id":["1","2","3"],
"TS":["2010-06-28 00:00:00.0","2010-06-28 00:00:00.0","2010-06-28 00:00:00.0"],
"Type":["VIDEO","IMAGE","VIDEO"]
}
but i need the response like this
{
"0":["1","2010-06-28 00:00:00.0","VIDEO"],
"1":["2","2010-06-28 00:00:00.0","IMAGE"],
"2":["3","2010-06-28 00:00:00.0","VIDEO"]
}
How can i get this.Any help would be appreciated. Thanks in advance.
**Update :**
Actually at the first level its not a problem. When we are going
more than one level then only the
problem arises. right now i'm putting
the entire response here to make it
clear.
{
"responseHeader":{
"status":0,
"QTime":0,
"params":{
"facet":"true",
"indent":"on",
"start":"0",
"q":"laptop",
"wt":["json",
"json"],
"rows":"200"}},
"response":{"numFound":1,"start":0,"docs":[
{
"createdBy":"0",
"id":194,
"status":"ACTIVE",
"text":"Can i buy Sony laptop?",
"ansTS":["2010-07-01 00:00:00.0","2010-08-06 15:11:55.0","2010-08-11 15:28:13.0","2010-08-11 15:30:49.0","2010-08-12 01:45:48.0","2010-08-12 01:46:18.0"],
"mediaType":["VIDEO","VIDEO","VIDEO"],
"ansId":["59","76","77","78","80","81"],
"mediaId":[24,25,26],
]},
]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{
"catName":[]},
"facet_dates":{}}}
look at the mediaId , mediatype ,ansTS arrays. Its one to many relationship.But they are grouped by column names.Thanks in advance.
You mentioned that you will consume this JSON from a browser. So you can use jQuery or any other javascript library to convert the raw Solr JSON response into the structure that you need.
If the first snippet is the actual solr response you're getting, then chances are you have a bug in your feeder (connector/crawler/etc). It looks like you only have one indexed document (that matches your query), which has all the values that you expect from 3 documents.
Assuming you have 3 documents, analogous with your expected output, then the actual solr wt=json result would contain:
[{
"id":"1",
"TS":"2010-06-28 00:00:00.0",
"Type":"VIDEO"
},
{
"id":"2",
"TS":"2010-06-28 00:00:00.0",
"Type":"IMAGE"
},
{
"id":"3",
"TS":"2010-06-28 00:00:00.0",
"Type":"VIDEO"
}]
If this assumption is correct, then I would suggest looking over your indexing logic.
This output is produced by Solr's JSONResponseWriter. Its output can't be altered via configuration. But what you can do is create your own version of JSONResponseWriter to produce your desired output. You can registered your new ResponseWriter by adding a queryResponseWriter tag in solrconfig.xml.