Conditional Upsert in Couchbase - couchbase

I have documents that would look like this:
{
"name": "n",
"age": 22
//other properties
"hash": "XyRZHDJJD6738..." //This property contains the hash of the object (calculated by the client)
}
From the client, I should whether:
Update the document using its key (known), ONLY if the hash is different (=> The stored object and the new object are not the same)
Insert the document if the Key doesn't exist
This operation is done in a bulk mode on a relatively large dataset, with concurrent access => So fetching the document then updating is not an option.
Is there a way to do this in Couchbase (5.1+)?

With a tweak to the document model, you could have something like this:
{
"name": "n",
"age": 22,
"applied_hashes": {
"XyRZHDJJD6738": null,
"AB2343DCxdsAd": null,
// ... other hashes
}
}
Now you can do each update as a Sub-Document operation, with the first spec being to try and insert the hash of the update into applied_hashes. If that hash/update has previously been applied, then this insert will fail, and as Sub-Document is atomic no changes will be made to the document.
With Java SDK 3.x this looks like:
try {
collection.mutateIn("id",
Arrays.asList(
MutateInSpec.insert("applied_hashes.XyRZHDJJD6738", null).createPath(),
MutateInSpec.upsert("age", 24)
// .. other parts of update XyRZHDJJD6738 here
));
}
catch (PathExistsException err) {
// Update XyRZHDJJD6738 has already been applied
// No changes have been made to the document
}

Related

ADF - Data Flow- Json Expression for Property name

I have a requirement to convert the json into csv(or a SQL table) or any other flatten structure using Data Flow in Azure Data Factory. I need to take the property names at some hierarchy and values of the child properties at lower of hierrarchy from the source json and add them both as column/row values in csv or any other flatten structure.
Source Data Rules/Constraints :
Parent level data property names will change dynamically (e.g. ABCDataPoints,CementUse, CoalUse, ABCUseIndicators names are dynamic)
The hierarchy always remains same as in below sample json.
I need some help in defining Json path/expression to get the names ABCDataPoints,CementUse, CoalUse, ABCUseIndicators etc. I am able to figure out how to retrieve the values for the properties Value,ValueDate,ValueScore,AsReported.
Source Data Structure :
{
"ABCDataPoints": {
"CementUse": {
"Value": null,
"ValueDate": null,
"ValueScore": null,
"AsReported": [],
"Sources": []
},
"CoalUse": {
"Value": null,
"ValueDate": null,
"AsReported": [],
"Sources": []
}
},
"ABCUseIndicators": {
"EnvironmentalControversies": {
"Value": false,
"ValueDate": "2021-03-06T23:22:49.870Z"
},
"RenewableEnergyUseRatio": {
"Value": null,
"ValueDate": null,
"ValueScore": null
}
},
"XYZDataPoints": {
"AccountingControversiesCount": {
"Value": null,
"ValueDate": null,
"AsReported": [],
"Sources": []
},
"AdvanceNotices": {
"Value": null,
"ValueDate": null,
"Sources": []
}
},
"XYXIndicators": {
"AccountingControversies": {
"Value": false,
"ValueDate": "2021-03-06T23:22:49.870Z"
},
"AntiTakeoverDevicesAboveTwo": {
"Value": 4,
"ValueDate": "2021-03-06T23:22:49.870Z",
"ValueScore": "0.8351945854483925"
}
}
}
Expected Flatten structure
Background:
After having multiple calls with ADF experts at Microsoft(Our workplace have Microsoft/Azure partnership), they concluded this is not possible with out of the box activities provided by ADF as is, neither by Dataflow(need not to use data flow though) nor Flatten feature. Reasons are Dataflow/Flatten only unroll the Array objects and there are no mapping functions available to pick the property names - Custom expression are in internal beta testing and will in PA in near future.
Conclusion/Solution:
We concluded with an agreement based on calls with Microsoft emps ended up to go multiple approaches but both needs the custom code - with out custom code this is not possible by using out of box activities.
Solution-1 : Use some code to flatten as per requirement using a ADF Custom Activity. The downside of this you need to use an external compute(VM/Batch), the options supported are not on-demand. So it is little bit expensive but works best if have continuous stream workloads. This approach also continuously monitor if input sources are of different sizes because the compute needs to be elastic in this case or else you will get out of memory exceptions.
Solution-2 : Still needs to write the custom code - but in a function app.
Create a Copy Activity with source as the files with Json content(preferably storage account).
Use target as Rest Endpoint of function(Not as a function activity because it has 90sec timeout when called from an ADF activity)
The function app will takes Json lines as input and parse and flatten.
If you use the above way so you can scale the number of lines cane be send in each request to function and also scale the parallel requests.
The function will do the flatten as required to one file or multiple files and store in blob storage.
The pipeline will continue from there as needed from there.
One problem with this approach is if any of the range is failed the copy activity will retry but it will run the whole process again.
Trying something very similar, is there any other / native solution to address this?
As mentioned in the response above, has this been GA yet? If yes, any reference documentation / samples would be of great help!
Custom expression are in internal beta testing and will in PA in near future.

Set variable from activity response Azure Data Factory

I have REST call in a Copy data activity which gives me a json response
My goal is to fetch the "hasNextPage" value and put it into the hasNext variable
I want to set it as a value in a "Set variable" activity that is connected to the "Copy data" activity, where I expected to acess the output in a way like this: #activity('Timesheets').output.data.timesheets.pageinfo.hasNext
I also want to be able to fetch the value of "cursor" from the last element in the "edges" array[]
I couldn't find any documentation on how to do this
Json response that I get from the Timesheets activity
[
{
"data": {
"timesheets": {
"pageInfo": {
"hasNextPage": true
},
"edges": [
{
"cursor": "81836000243260.81836000243275.",
"node": {
"parameter1": "2019-11-04",
"parameter2": "81836000243260"
}
},
{
"cursor": "81836000243252.81836000243260.81836000243275",
"node": {
"parameter1": "2019-11-04",
"parameter2": "81836000243260"
}
}
]
}
}
}
]
According to this, the output of an copy data activity don't have a data property you can access.
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview
Copy Activity are made for copying large data, and it doesn't copy all rows in one go.
So it would not make sense to have an output dataset for a Copy Activity.
If your response from your REST service contains limited element, you can use an Web Activity to consume the REST service.
This have an output dataset you can access.
Followed by a foreach activity to iterate the data set. Remember to take into consideration parallel vs sequential iteration of you data set in the foreach activity.
Note in your service response, you get an array of "data" objects, so you need to address the first "data" element.

getDegree()/isOutgoing() funcitons don't work in graphAware/neo4j-to-elasticsearch mapping.json

Neo4j Version: 3.2.2
Operating System: Ubuntu 16.04
I use getDegree() function in mapping.json file, but the return would always be null; I'm using the dataset neo4j tutorial Movie/Actor dataset.
Output from elasticsearch request
mapping.json
{
"defaults": {
"key_property": "uuid",
"nodes_index": "default-index-node",
"relationships_index": "default-index-relationship",
"include_remaining_properties": true
},
"node_mappings": [
{
"condition": "hasLabel('Person')",
"type": "getLabels()",
"properties": {
"getDegree": "getDegree()",
"getDegree(type)": "getDegree('ACTED_IN')",
"getDegree(direction)": "getGegree('OUTGOING')",
"getDegree('type', 'direction')": "getDegree('ACTED_IN', 'OUTGOING')",
"getDegree-degree": "degree"
}
}
],
"relationship_mappings": [
{
"condition": "allRelationships()",
"type": "type",
}
]
}
Also if I use isOutgoing(), isIncoming(), otherNode function in relationship_mappings properties part, elasticsearch would never load the relationship data from neo4j. I think I probably have some misunderstanding of this sentence only when one of the participating nodes "looking" at the relationship is provided on this page https://github.com/graphaware/neo4j-framework/tree/master/common#inclusion-policies
mapping.json
{
"defaults": {
"key_property": "uuid",
"nodes_index": "default-index-node",
"relationships_index": "default-index-relationship",
"include_remaining_properties": true
},
"node_mappings": [
{
"condition": "allNodes()",
"type": "getLabels()"
}
],
"relationship_mappings": [
{
"condition": "allRelationships()",
"type": "type",
"properties": {
"isOutgoing": "isOutgoing()",
"isIncomming": "isIncomming()",
"otherNode": "otherNode"
}
}
]
}
BTW, is there any page that list all of the functions that we can use in mapping.json? I know two of them
github.com/graphaware/neo4j-framework/tree/master/common#inclusion-policies
github.com/graphaware/neo4j-to-elasticsearch/blob/master/docs/json-mapper.md
but it seems there are more, since I can use getType(), which hasn't been listed in any of the above pages.
Please let me know if I can provide any further help to solve the problem
Thanks!
The getDegree() function is not available to use, in contrary to getType(). I will explain why :
When the mapper (the part responsible to create a node or relationship representation as ES document ) is doing its job, it receive a DetachedGraphObject being a detached node or relationship.
The meaning of detached is that it is happening outside of a transaction and thus query operations are not available against the database anymore. The getType() is available because it is part of the relationship metadata and it is cheap, however if we would want to do the same for getDegree() this can be seriously more costly during the DetachedObject creation (which happen in a tx) depending on the number of different types etc.
This is however something we are working on, by externalising the mapper in a standalone java application coupled with a broker like kafa, rabbit,.. between neo and this app. We would not, however offer the possibilty to requery the graph in the current version of the module as it can have serious performance impacts if the user is not very careful.
As last, the only suggestion I can give you is to keep a property on your node with the updates of degrees you need to replicate to ES.
UPDATE
Regarding this part of the documentation :
For Relationships only when one of the participating nodes "looking" at the relationship is provided:
This is used only when not using the json definition, so you can use one or the other. the json definition has been added later as addition and both cannot be used together.
For answering this part, it means that the nodes of the incoming or outgoing side, depending on the definition, should be included in the inclusion policy for nodes, like hasLabel('Employee') || hasProperty('form') || getProperty('age', 0) > 20 . If you have an allNodes policy then it is fine.

Couchbase View not returning array value

I am trying to create a view to group on a particular attribute inside an array. However, the below map function is not returning any result.
JSON Document Structure :
{
"jsontype": "survey_response",
"jsoninstance": "xyz",
"jsonlanguage": "en_us",
"jsonuser": "test#test.com",
"jsoncontact": "test#mayoclinic.com",
"pages": [
{
q-placeholder": "q1-p1",
q:localized": "q1-localized-p1",
q-answer-placeholder": "jawaabu121",
q-answer-localized": "localized jawaabu1"
},
{
q-placeholder": "q2-p2",
q:localized": "q2-localized-p2",
q-answer-placeholder": "jawaabu221",
q-answer-localized": "localized jawaabu2"
},
{
"q-placeholder": "q3-p3",
"q:localized": "q3-localized-p3",
"q-answer-placeholder": "jawaabu313",
"q-answer-localized": "localized jawaabu3"
}
]
}
Map Function :
function(doc, meta){
emit(doc.jsoninstance,[doc.pages[0].q-placeholder, doc.pages[0].q-localized,doc.pages[0].q-answer-placeholder,q-answer-localized]);
}
It looks like you made a typo at the end of your emit statement:
doc.pages[0].q-answer-placeholder,q-answer-localized.
Instead q-answer-localized should be changed to doc.pages[0].q-answer-localized.
Further to this it seems that you have defined a field as q-localized in your emit statement, but actually according to the sample document that you posted this should actually be q:localized, I assume that this was a mistake in the snippet of the document and not the emit statement, but if not then will also need amending.
I would imagine errors like this would be flagged up in the view engine's map-reduce error log, in future you should check this log so that you will be able to debug errors like this yourself.
The location of the mapreduce_errors log can be found in the Couchbase documentation

How to tell when a document was created/updated in Couchbase?

I am using spring data to fetch couchbase documents. I assume the document expiry cannot be set as a configurable value externalized in a property file (#Document(expiry = )) based on answer in this link
However can I define a view in cochbase and have query to return only the documents which have been created in the past 15 minutes? I don't want to save additional information regarding date in my document and I am looking to see if there is any meta data that can be used for this purpose
function (doc, meta) {
if(doc._class=="com.customer.types.CustomerInfo" ){
emit(meta.id, doc);
}
}
If you go into the webconsole and create a development view, you can have a look at what metadata are available using the preview:
emit(meta.id, meta)
This gives us
{
"id": "someKey",
"rev": "3-00007098b90700000000000002000000",
"seq": "3",
"vb": "56",
"expiration": 0,
"flags": 33554432,
"type": "json"
}
So as you can see, there is no explicit metadata that you can use to tell if a doc has been created/updated in the last 15 minutes.
Side note: the id of the doc will always be part of the view's response so emitting it is a bit redundant. More importantly, don't emit the whole doc as a value (if in doubt on what to emit as the second field, just emit null). This value field is stored in the index, so that means that you are basically storing your document's content twice (once in the primary store, once in the view index)!