Growing number of couchbase binary documents in XDCR destination bucket - couchbase

I am running couchbase enterprise edition version 6.6.2 on Windows server 2016 standard edition.
I have two buckets called A and B. Bucket A is configured to run with enable_shared_bucket_access = true, my sync gateway creates new documents in bucket A, a bunch of services change and delete these documents.
XDCR replicates documents from bucket A to bucket B. All changes to documents in bucket A are replicated to bucket B, except deletions in bucket A are not replicated to bucket B. When documents in bucket B get older than 62 days they get deleted by an external service.
Over time I noticed that 93% of the documents in bucket B are binary documents! My own documents are in JSON, I don’t use any kind of binary documents in my solution. This leads me to the conclusion that these binary documents are some internal couchbase documents.
Here is a example of these binary documents
{
"$1": {
"cas": 1667520921496387584,
"expiration": 0,
"flags": 50331648,
"id": "_sync:rev:00001abd-1f99-4b4e-a695-d11574ea9ed8:0:",
"type": "base64"
},
"pa": "<binary (1 b)>"
},
{
"$1": {
"cas": 1667484959445614592,
"expiration": 0,
"flags": 50331648,
"id": "_sync:rev:00001abd-1f99-4b4e-a695-d11574ea9ed8:34:2-d3fb2d58672f853d98ce343d3ae84c1d",
"type": "base64"
},
"pa": "<binary (1129 b)>"
}
My issue with these documents is that they increase dramatically over time! and they don’t get cleaned up automatically! So they just grow and consume resources!
What are these documents used for?
Why aren’t these documents cleaned automatically?
Is it safe to simply delete these documents?
Is this a bug or a feature? :-)
Regards,
Siraf

The issue was solved by adding this AND NOT REGEXP_CONTAINS(META().id,"^_sync:rev") to the XDCR replication filter expression. This stopped binary documents do be replicated from bucket A to B.

Related

Couchbase Document with abnormal metadata

We're running Couchbase Server with Sync Gateway in our project. When I query Couchbase, I encounter documents with Metadata below. It has fields flags and expiration equal to zero.
The problem is, this document is not in the DB yesterday. Is it normal to have a document with metadata like below? Document body has a lastupdate field that points to a date last year. It's possible that we deleted it before, but it reappears now.
When I check couchbase and Sync Gateway logs, I didn't encounter any logs belonging to its id.
{
"flags": 0,
"expiration": 0,
"id": "SomeIDXXX::YYY",
"cas": 1669165471522357248,
"type": "json"
}

MySQL InnoDB Cluster config - configure node address

I'm setting up an InnoDB Cluster using mysqlsh. This is in Kubernetes, but I think this question applies more generally.
When I use cluster.configureInstance() I see messages that includes:
This instance reports its own address as node-2:3306
However, the nodes can only find each other through DNS at an address like node-2.cluster:3306. The problem comes when adding instances to the cluster; they try to find the other nodes without the qualified name. Errors are of the form:
[GCS] Error on opening a connection to peer node node-0:33061 when joining a group. My local port is: 33061.
It is using node-n:33061 rather than node-n.cluster:33061.
If it matters, the "DNS" is set up as a headless service in Kubernetes that provides consistent addresses as pods come and go. It's very simple, and I named it "cluster" to created addresses of the form node-n.cluster. I don't want to cloud this question with detail I don't think matters, however, as surely other configurations require the instances in the cluster to use DNS as well.
I thought that setting localAddress when creating the cluster and adding the nodes would solve the problem. Indeed, after I added that to the createCluster options, I can look in the database and see
| group_replication_local_address | node-0.cluster:33061 |
After I create the cluster and look at the topology, it seems that the local address setting has no effect whatsoever:
{
"clusterName": "mycluster",
"defaultReplicaSet": {
"name": "default",
"primary": "node-0:3306",
"ssl": "REQUIRED",
"status": "OK_NO_TOLERANCE",
"statusText": "Cluster is NOT tolerant to any failures.",
"topology": {
"node-0:3306": {
"address": "node-0:3306",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"replicationLag": null,
"role": "HA",
"status": "ONLINE",
"version": "8.0.29"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "node-0:3306"
}
And adding more instances continues to fail with the same communication errors.
How do I convince each instance that the address it needs to advertise is different? I will try other permutations of the localAddress setting, but it doesn't look like it's intended to fix the problem I'm having. How do I reconcile the address the instance reports for itself with the address that's actually useful for other instances to find it?
Edit to add: Maybe it is a Kubernetes thing? Or a Docker thing at any rate. There is an environment variable set in the container:
HOSTNAME=node-0
Does the containerized MySQL use that? If so, how do I override it?
Apparently this value has to be set at startup. The option for my setup was
--report-host=${HOSTNAME}.cluster
when starting the MySQL instances resolved the issue.
Specifically for Kubernetes, an example is at https://github.com/adamelliotfields/kubernetes/blob/master/mysql/mysql.yaml

ModelDeriviate's manifest is missing URN for SVF2

I know that it is possible to download the derivatives via their respective urns. However, the SVF2 object in the manifest doesn't contain its urn. Therefore, I cannot download the derivative as explained here or here. Is this not supported yet? And can I compute the urn from the data returned in the manifest?
Extract of an manifest example:
{
"urn": "SOME_URN",
"derivatives": [
{
"hasThumbnail": "true",
"children": [
{
"useAsDefault": true,
"role": "3d",
"hasThumbnail": "true",
"children": [
{
...
},
{
...
},
{
"role": "graphics",
"mime": "application/autodesk-svf2",
"guid": "SOME_GUID",
"type": "resource"
}
],
I'd like to make clear that it is possible to download the SVF2 'files' since your WEB browser can do it; therefore, you can access the data as well. The files are actually cached in your Browser, see below.
The Viewer downloads an extra manifest files (otg_model.json) which contains additional information. But downloading the 'files' on your local machine will not help since it requires a lot of setup to get the Viewer work properly with a local SVF2 storage. And with the current state of the technology, it is highly recommended you do not try to do this in production. When it comes to development, and debugging, I go a sample posted here which can help. But please be careful with the Autodesk EULA on doing offline workflows. This sample is a replacement of the old extract.autodesk.io sample as people were abusing of this website, and can work with both SVF and SVF2.
To answer the question in the comment section. SVF2 is still in beta, and access to the underlying data/files will probably be only available at the end on the beta. The main reason is that SVF2 and the Viewer code evolves too rapidly today to make a general availability to everything. So unless you keep updating them on your local machine, things may break, and therefore Autodesk is limiting the access.
Sorry for disappointing you, but ...
Unfortunately, it's expected behavior. SVF2 doesn't have a concept of URN, and you cannot download SVF2 for offline viewing at this moment since it's unsupported.

Couchbase deleted documents reappearing in database

We are experiencing a problem where deleted documents are reappearing on our Couchbase server.
We have a scenario where documents are created on CBL. These documents are synced up to the server. The user realizes an error has been made and flags the document as incorrect. On the server the admin can then view all of the flagged documents and delete them from the server. The sync gateway has been setup to only sync up these types of documents, i.e. once an edit has been made to these documents on the server the changes are not synced back down to CBL.
Here is the process of what is happening:
Document is created on CBL with a TTL of 15 days and synced to sync
Document is updated on CBL and synced to sync gateway.
Document is deleted from the Couchbase Server bucket with a DELETE
N1QL query.
After the document is deleted from the bucket it gets
randomly added again within a few days.
Only documents that are
still on the devices i.e. not older than TTL of 15 days, are added
back to the bucket.
We tried increasing the Metadata Purge Interval to more than 15 days but this did not resolve the problem.
Does anybody have any suggestions or possibly know what could be the problem here?
Couchbase Server Community Edition 6.5.1 build 6299
Sync gateway 2.7.3
Couchbase Lite Android 2.8.1
Thanks in advance!
PS: Here is our Sync Gateway config with the sync function:
"log": [
"*"
],
"adminInterface": "0.0.0.0:4985",
"interface": "0.0.0.0:4984",
"databases": {
"prod": {
"server": "http://localhost:8091",
"bucket": "prod_bukcet",
"username": "sync_gateway",
"password": "XXX",
"enable_shared_bucket_access": true,
"import_docs": "continuous",
"use_views": true,
"users": {
"user_X": {
"password": "XXX",
"admin_channels": ["*"],
"disabled": false
}
},
"sync":`
function sync(doc, oldDoc) {
/* sanity check */
// check if document was removed from server or via SDK
// In this case, just return
if (isRemoved()) {
return;
}
//Only sync down documents that are created on the server
if (doc.deviceDoc == true) {
channel("server");
} else {
if (doc.siteId) {
channel(doc.siteId);
} else {
channel("devices");
}
}
// This is when document is removed via SDK or directly on server
function isRemoved() {
return (isDelete() && oldDoc == null);
}
function isDelete() {
return (doc._deleted == true);
}
}`,
}
}
}
In shared bucket access mode (enable_shared_bucket_access:true),a N1QL delete on a document creates a tombstone. Tombstones are always synced. The metadata purge interval setting on the server determines the period after which the tombstone gets purged on the server. So it is typical to set it to a value that matches the maximum partition window of the client- that is to ensure that all disconnected clients have the opportunity to get the deleted document. So setting it to > 15 days just means that the tombstone will be purged after 15 days and so tombstoned documents will be synced down to clients in the meantime.
In your case, if you don't want documents to be synced down to clients because the lifetime of the document is managed independently on CBL side via the expirationDate(), then purge the document instead of deleting it on server.

HTTP ReST: update large collections: better approach than JSON PATCH?

I am designing a web service to regularly receive updates to lists. At this point, a list can still be modeled as a single entity (/lists/myList) or an actual collection with many resources (/lists/myList/entries/<ID>). The lists are large (millions of entries) and the updates are small (often less than 10 changes).
The client will get web service URLs and lists to distribute, e.g.:
http://hostA/service/lists: list1, list2
http://hostB/service/lists: list2, list3
http://hostC/service/lists: list1, list3
It will then push lists and updates as configured. It is likely but undetermined if there is some database behind the web service URLs.
I have been researching and it seems a HTTP PATCH using the JSON patch format is the best approach.
Context and examples:
Each list has an identifying name, a priority and millions of entries. Each entry has an ID (determined by the client) and several optional attributes. Example to create a list "requiredItems" with priority 1 and two list entries:
PUT /lists/requiredItems
Content-Type: application/json
{
"priority": 1,
"entries": {
"1": {
"color": "red",
"validUntil": "2016-06-29T08:45:00Z"
},
"2": {
"country": "US"
}
}
}
For updates, the client would first need to know what the list looks like now on the server. For this I would add a property "revision" to the list entity.
Then, I would query this attribute:
GET /lists/requiredItems?property=revision
Then the client would see what needs to change between the revision on the server and the latest revision known by the client and compose a JSON patch. Example:
PATCH /list/requiredItems
Content-Type: application/json-patch+json
[
{ "op": "test", "path": "revision", "value": 3 },
{ "op": "add", "path": "entries/3", "value": { "color": "blue" } },
{ "op": "remove", "path": "entries/1" },
{ "op": "remove", "path": "entries/2/country" },
{ "op": "add", "path": "entries/2/color", "value": "green" },
{ "op": "replace", "path": "revision", "value": 10 }
]
Questions:
This approach has the drawback of slightly less client support due to the not-often-used HTTP verb PATCH. Is there a more compatible approach without sacrificing HTTP compatibility (idempotency et cetera)?
Modelling the individual list entries as separate resources and using PUT and DELETE (perhaps with ETag and/or If-Match) seems an option (PUT /lists/requiredItems/entries/3, DELETE /lists/requiredItems/entries/1 PUT /lists/requiredItems/revision), but how would I make sure all those operations are applied when the network drops in the middle of an update chain? Is a HTTP PATCH allowed to work on multiple resources?
Is there a better way to 'version' the lists, perhaps implicitly also improving how they are updated? Note that the client determines the revision number.
Is it correct to query the revision number with GET /lists/requiredItems?property=revision? Should it be a separate resource like /lists/requiredItems/revision? If it should be a separate resource, how would I update it atomically (i.e. the list and revision are both updated or both not updated)?
Would it work in JSON patch to first test the revision value to be 3 and then update it to 10 in the same patch?
This approach has the drawback of slightly less client support due to the not-often-used HTTP verb PATCH.
As far as I can tell, PATCH is really only appropriate if your server is acting like a dumb document store, where the action is literally "please update your copy of the document according to the following description".
So if your resource really just is a JSON document that describes a list with millions of entries, then JSON-Patch is a great answer.
But if you are expecting that the patch will, as a side effect, update an entity in your domain, then I'm suspicious.
Is a HTTP PATCH allowed to work on multiple resources?
RFC 5789
The PATCH method affects the resource identified by the Request-URI, and it also MAY have side effects on other resources
I'm not keen on querying the revision number; it doesn't seem to have any clear advantage over using an ETag/If-Match approach. Some obvious disadvantages - the caches between you and the client don't know that the list and the version number are related; a cache will happily tell a client that version 12 of the list is version 7, or vice versa.
Answering my own question. My first bullet point may be opinion-based and, as has been pointed out, I've asked many questions in one post. Nevertheless, here's a summary of what was answered by others (VoiceOfUnreason) and my own additional research:
ETags are HTTP's resource 'hashes'. They can be combined with If-Match headers to have a versioning system. However, ETag-headers are normally not used to declare the ETag of a resource that is being created (PUT) or updated (POST/PATCH). The server storing the resource usually determines the ETag. I've not found anything explicitly forbidding this, but many implementations may assume that the server determines the ETag and get confused when it is provided with PUT or PATCH.
A separate revision resource is a valid alternative to ETags for versioning. This resource must be updated at the same time as the resource it is the revision of.
It is not semantically enforceable on a HTTP level to have commit/rollback transactions, unless by modelling the transaction itself as a ReST resource, which would make things much more complicated.
However, some properties of PATCH allow it to be used for this:
A HTTP PATCH must be atomic and can operate on multiple resources. RFC 5789:
The server MUST apply the entire set of changes atomically and never provide (e.g., in response to a GET during this operation) a partially modified representation. If the entire patch document cannot be successfully applied, then the server MUST NOT apply any of the changes.
The PATCH method affects the resource identified by the Request-URI, and it also MAY have side effects on other resources; i.e., new resources may be created, or existing ones modified, by the application of a PATCH. PATCH is neither safe nor idempotent
JSON PATCH can consist of multiple operations on multiple resources and all must be applied or none must be applied, making it an implicit transaction. RFC 6902: Operations are applied sequentially in the order they appear in the array.
Thus, the revision can be modeled as a separate resource and still be updated at the same time. Querying the current revision is a simple GET. Committing a transaction is a single PATCH request containing first a test of the revision, then the operations on the resource(s) and finally the operation to update the revision resource.
The server can still choose to publish the revision as ETag of the main resource.