Overwriting Documents in DocumentDB - json

I have a eventhub/stream analytics/documentdb chain, stream analytics job takes json object and persist it in a documentdb collection by id.
If a document with the same id exists in the collection, it should get overwritten right? But this is not happening.
let say I have this object in the collection:
{
"id" : "001",
"array":[
{
"key1" : "value1"
},
{
"key2" : "value2"
},
{
"key3" : "value3"
{
]
}
And the new document that is persisted by the stream job is :
{
"id" : "001",
"array":[
{
"key4" : "value4"
},
{
"key5" : "value5"
}
]
}
The new document that I get in the collection looks like this:
{
"id" : "001",
"array":[
{
"key4" : "value4"
},
{
"key5" : "value5"
},
{
"key3" : "value3"
{
]
}
The array doesn't get overwritten, only object within the size of the new document that is being saved. If oldarray.size > newarray.size some old data will still be there.
I want to prevent this. I want to overwrite all the document and get rid of all the old data.
Is there a way to do that?

Related

How to use JSON.parse with bucket_script?

I have a data field saved with JSON string and I need to count the average value of price in "{price: 10}", how do I use JSON.parse with bucket_script to compute this with elastic search?
There is no JSON parsing class in painless So you cannot perform this while querying. You should parse JSOn while indexing this will make your search queries faster.
Ingest
You can use JSON Precessor
{
"json" : {
"field" : "string_source",
"target_field" : "json_target"
}
}
Pipeline
PUT _ingest/pipeline/my-pipeline
{
"description": "describe pipeline",
"processors": [
{
"json": {
"field": "string_source",
"target_field": "json_target"
}
}
]
}
Index document using ingest pipeline
POST json_index/_doc?pipeline=my-pipeline
{
"string_source":"{\"price\":10}"
}
Document
"hits" : [
{
"_index" : "json_index",
"_type" : "_doc",
"_id" : "m7t3gXEB1B5aJp__0oos",
"_score" : 1.0,
"_source" : {
"json_target" : {
"price" : 10
},
"string_source" : """{"price":10}"""
}
}
]
If you don't want to keep original string in index you can use
PUT _ingest/pipeline/my-pipeline
{
"description": "describe pipeline",
"processors": [
{
"json": {
"field": "string_source",
"target_field": "json_target"
},
"remove": {
"field": "string_source"
}
}
]
}
2. Logstash
This is a JSON parsing filter. It takes an existing field which contains JSON and expands it into an actual data structure within the Logstash event.
filter {
json {
source => "message"
}
}

Method to assign object IDs to imported JSON in Firebase

Firebase is organizing an imported JSON file in the following way:
But the imported file (and exported file from Firebase) is organized this way:
{
"features" : [ {
"geometry" : {
"coordinates" : [ -77.347191, 36.269321 ],
"type" : "Point"
},
"properties" : {
"name" : "Branch Chapel",
"osm_id" : "262661",
"religion" : "christian"
},
"type" : "Feature"
},
...
It appears that Firebase assigns an internal number to each object in the array of "features". This is nice, but it makes it hard to reference each object without knowing how Firebase is naming it- and I have 400k+ objects.
Is there a way to assign an id to each object to prevent Firebase from generating its own? Or is there a way to programmatically rename/reorganize the data after it's been imported? The optimal outcome would have the object named by its osm_id, rather than some arbitrary number Firebase assigns.
Any help is appreciated.
get rid of the square brackets and replace with squiggley brackets
this
{
"flags": {
"1": {
"information": "blah",
},
"2": {
"information": "It is great!",
},
"3": {
"information": "Amazing!",
}
}
}
not this
[
{
"1": {
"information": "blah",
}
},
{
"2": {
"information": "It is great!",
}
},
{
"3": {
"information": "Amazing!",
}
}
]

JSON-LD context with array of objects

I am trying to define a JSON-LD context that includes an array of objects.
Does anyone know, why the output is empty?
{
"#context": {
"testobjects": {
"#id" : "http://example.org/arrayOfObjects",
"type" : "array",
"items" : {
"type" : "object",
"properties" : {
"attr1": { "type" : "number", "default" : 1},
"attr2": { "type" : "string", "default" : "foo"}
}
}
}
},
"testobjects": [
{
"attr1": 216,
"attr2": "test"
},
{
"attr1": 329,
"attr2": "test2"
}
]
}
Output:
[
{
"http://example.org/arrayOfObjects": [
{},
{}
]
}
]
See JSON-LD Playground for trying yourself.
Note that data does not go in a context, so having items as a part of the "testobjects" term definition won't do. There is a proposal to allow multi-dimensional arrays to have individual types, that may be taken up in the future.
For the example you've provided to generate anything, both "attr1" and "attr2" would need to be defined as terms at the top-level of the context, not under an existing context definition.

How to fetch the data in the mongodb

How to fetch the data from the json file using mongoshell
I want to fecch the Data by policyID
Say in the json file I sent the PolicyID is 3148
I tried could of ways to write the command but say 0 rows fetched.
db.GeneralLiability.find({"properties.id":"21281"})
db.GeneralLiability.find({properties:{_id:"21281"}})
Do i need to set any thing else?index,cursors etc?
Sample json
{
"session": {
"data": {
"account": {
"properties": {
"userName": "abc.com",
"_dateModified": "2014-10-01",
"_manuscript": "Carrier_New_Rules_2_1_0",
"_engineVersion": "2.0.0",
"_cultureCode": "en-US",
"_cultureName": "United States [english]",
"_context": "Underwriter",
"_caption": "Carrier New Rules (2.1.0)",
"_id": "p1CEB08012E51477C9CD0E89FE77F5E51"
},
"properties": {
"_xmlns:xsd": "http://www.w3.org/2001/XMLSchema",
"_xmlns:xsi": "http://www.w3.org/2001/XMLSchema-instance",
"_id": "3148",
"_HistoryID": "5922",
"_Type": "onset",
"_Datestamp": "2014-10-01T04:46:33",
"_TransactionType": "New",
"_EffectiveDate": "2014-01-01",
"_Charge": "1599",
"_TransactionGroup": "t4CE4FA751F9C400D9007E692A883DA66",
"_PolicyID": "3148",
"_Index": "1",
"_Count": "1",
"_Sequence": "1"
}
}
}
This will return the document with _PolicyID = "3148":
db.GeneralLiability.find({
"session._PolicyID": "3148"
}).pretty();
You have some issues in your document formatting. First off I am pretty sure that using underscores are reserved for mongo (I could be wrong). Either way it is bad form. I have restructured your data for you. I am not sure why you wanted to nest your data so much, but I am guessing you had a good reason for it.
You will notice that I am using the ObjectID from Mongo for my _id:
{
"_id" : ObjectId("56e1c1f53bac31a328e3682b"),
"session" : {
"data" : {
"account" : {
"properties" : {
"xmlns:xsd" : "http://www.w3.org/2001/XMLSchema",
"xmlns:xsi" : "http://www.w3.org/2001/XMLSchema-instance",
"HistoryID" : "5922",
"Type" : "onset",
"Datestamp" : "2014-10-01T04:46:33",
"TransactionType" : "New",
"EffectiveDate" : "2014-01-01",
"Charge" : "1599",
"TransactionGroup" : "t4CE4FA751F9C400D9007E692A883DA66",
"PolicyID" : "3148",
"Index" : "1",
"Count" : "1",
"Sequence" : "1"
}
}
}
}
}
Now if you run this command it will return your document:
{ "session.data.account.properties.PolicyID": "3148" }

Sub-records in Avro with Morphlines

I'm trying to convert JSON into Avro using the kite-sdk morphline module. After playing around I'm able to convert the JSON into Avro using a simple schema (no complex data types).
Then I took it one step further and modified the Avro schema as displayed below (subrec.avsc). As you can see the schema consist of a subrecord.
As soon as I tried to convert the JSON to Avro using the morphlines.conf and the subrec.avsc it failed.
Somehow the JSON paths "/record_type[]/alert/action" are not translated by the toAvro function.
The morphlines.conf
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**"]
commands : [
# Read the JSON blob
{ readJson: {} }
{ logError { format : "record: {}", args : ["#{}"] } }
# Extract JSON
{ extractJsonPaths { flatten: false, paths: {
"/record_type[]/alert/action" : /alert/action,
"/record_type[]/alert/signature_id" : /alert/signature_id,
"/record_type[]/alert/signature" : /alert/signature,
"/record_type[]/alert/category" : /alert/category,
"/record_type[]/alert/severity" : /alert/severity
} } }
{ logError { format : "EXTRACTED THIS : {}", args : ["#{}"] } }
{ extractJsonPaths { flatten: false, paths: {
timestamp : /timestamp,
event_type : /event_type,
source_ip : /src_ip,
source_port : /src_port,
destination_ip : /dest_ip,
destination_port : /dest_port,
protocol : /proto,
} } }
# Create Avro according to schema
{ logError { format : "WE GO TO AVRO"} }
{ toAvro { schemaFile : /etc/flume/conf/conf.empty/subrec.avsc } }
# Create Avro container
{ logError { format : "WE GO TO BINARY"} }
{ writeAvroToByteArray { format: containerlessBinary } }
{ logError { format : "DONE!!!"} }
]
}
]
And the subrec.avsc
{
"type" : "record",
"name" : "Event",
"fields" : [ {
"name" : "timestamp",
"type" : "string"
}, {
"name" : "event_type",
"type" : "string"
}, {
"name" : "source_ip",
"type" : "string"
}, {
"name" : "source_port",
"type" : "int"
}, {
"name" : "destination_ip",
"type" : "string"
}, {
"name" : "destination_port",
"type" : "int"
}, {
"name" : "protocol",
"type" : "string"
}, {
"name": "record_type",
"type" : ["null", {
"name" : "alert",
"type" : "record",
"fields" : [ {
"name" : "action",
"type" : "string"
}, {
"name" : "signature_id",
"type" : "int"
}, {
"name" : "signature",
"type" : "string"
}, {
"name" : "category",
"type" : "string"
}, {
"name" : "severity",
"type" : "int"
}
] } ]
} ]
}
The output on { logError { format : "EXTRACTED THIS : {}", args : ["#{}"] } } I output the following:
[{
/record_type[]/alert / action = [allowed],
/record_type[]/alert / category = [],
/record_type[]/alert / severity = [3],
/record_type[]/alert / signature = [GeoIP from NL,
Netherlands],
/record_type[]/alert / signature_id = [88006],
_attachment_body = [{
"timestamp": "2015-03-23T07:42:01.303046",
"event_type": "alert",
"src_ip": "1.1.1.1",
"src_port": 18192,
"dest_ip": "46.231.41.166",
"dest_port": 62004,
"proto": "TCP",
"alert": {
"action": "allowed",
"gid": "1",
"signature_id": "88006",
"rev": "1",
"signature" : "GeoIP from NL, Netherlands ",
"category" : ""
"severity" : "3"
}
}],
_attachment_mimetype=[json/java + memory],
basename = [simple_eve.json]
}]
UPDATE 2017-06-22
you MUST populate the data in the structure in order for this to work, by using addValues or setValues
{
addValues {
micDefaultHeader : [
{
eventTimestampString : "2017-06-22 18:18:36"
}
]
}
}
after debugging the sources of morphline toAvro, it appears that the record is the first object to be evaluated, no matter what you put in your mappings structure.
the solution is quite simple, but unfortunately took a little extra time, eclipse, running the flume agent in debug mode, cloning the source code and lots of coffee.
here it goes.
my schema:
{
"type" : "record",
"name" : "co_lowbalance_event",
"namespace" : "co.tigo.billing.cboss.lowBalance",
"fields" : [ {
"name" : "dummyValue",
"type" : "string",
"default" : "dummy"
}, {
"name" : "micDefaultHeader",
"type" : {
"type" : "record",
"name" : "mic_default_header_v_1_0",
"namespace" : "com.millicom.schemas.root.struct",
"doc" : "standard millicom header definition",
"fields" : [ {
"name" : "eventTimestampString",
"type" : "string",
"default" : "12345678910"
} ]
}
} ]
}
morphlines file:
morphlines : [
{
id : convertJsonToAvro
importCommands : ["org.kitesdk.**"]
commands : [
{
readJson {
outputClass : java.util.Map
}
}
{
addValues {
micDefaultHeader : [{}]
}
}
{
logDebug { format : "my record: {}", args : ["#{}"] }
}
{
toAvro {
schemaFile : /home/asarubbi/Development/test/co_lowbalance_event.avsc
mappings : {
"micDefaultHeader" : micDefaultHeader
"micDefaultHeader/eventTimestampString" : eventTimestampString
}
}
}
{
writeAvroToByteArray {
format : containerlessJSON
codec : null
}
}
]
}
]
the magic lies here:
{
addValues {
micDefaultHeader : [{}]
}
}
and in the mappings:
mappings : {
"micDefaultHeader" : micDefaultHeader
"micDefaultHeader/eventTimestampString" : eventTimestampString
}
explanation:
inside the code the first field name that is evaluated is micDefaultHeader of type RECORD. as there's no way to specify a default value for a RECORD (logically correct), the toAvro code evaluates this, does not get any value configured in mappings and therefore it fails at it detects (wrongly) that the record is empty when it shouldn't.
however, taking a look at the code, you may see that it requires a Map object, containing no values to please the parser and continue to the next element.
so we add a map object using the addValues and fill it with an empty map [{}]. notice that this must match the name of the record that is causing you an empty value. in my case "micDefaultHeader"
feel free to comment if you have a better solution, as this looks like a "dirty fix"