Identifying Duplicates in CouchDB

Identifying Duplicates in CouchDB - json

I'm new to CouchDB and document-oriented databases in general.
I've been playing around with CouchDB, and was able to get familiar with creating documents (with perl) and using the Map/Reduce functions in Futon to query the data and create views.
One of the things I'm still trying to figure out is how to identify duplicate values across documents using Futon's Map/Reduce.
For example, if I have the following documents:
{
"_id": "123",
"name": "carl",
"timestamp": "2012-01-27T17:06:03Z"
}
{
"_id": "124",
"name": "carl",
"timestamp": "2012-01-27T17:07:03Z"
}
And I wanted to get a list of document id's that had duplicate "name" values, is this something I could do with the Futon Map/Reduce?
The result was hoping to achieve is as follows:
{
"name": "carl",
"dupes": [ "123", "124" ]
}
..or..
{
"carl": [ "123", "124" ]
}
.. which would be the value, and associated document ids which contain those duplicate values.
I've tried a few different things with Map/Reduce, but so far as I understand, the Map function works with data on a per-document basis, and the Reduce functions only allow you to work with the keys/values from a given document.
I know i could just pull the data I need with perl, work magic there, and get the result I want, but I'm trying to work only with CouchDB for now in order to better understand it's benefits / limitations.
Another way I'm thinking about doing this is to use a single document like an RDBMS table:
{
"_id": "names",
"rec1": {
"_id": "123",
"name": "carl",
"timestamp": "2012-01-27T17:06:03Z"
},
"rec2": {
"_id": "124",
"name": "carl",
"timestamp": "2012-01-27T17:07:03Z"
}
}
.. which should allow me to use the Map/Reduce functions in the way I originally thought. However I'm not sure if this is ideal.
I understand that my mind is still stuck in RDBMS land, so much of what I'm trying to do above may not be necessary. Any insight on this would be much appreciated.
Thanks!
Edit: Fixed JSON syntax in some of the examples.

If you merely want a list of unique values, that's pretty easy. If you wish to identify the duplicates, then it gets less easy.
In both cases, a map function like this should suffice:
function (doc) {
emit(doc.name);
}
For your reduce function, just enter _count.
Your view output will look like: (based on your 2 documents)
{
"rows": [
{ "key": "carl", "value": 2 }
]
}
From there, you will have a list of names as well as their frequency. You can take that list and filter it yourself, or you can take the "all couch" route and use a _list function to perform that final filtering.
function (head, req) {
var row, duplicates = [];
while (row = getRow()) {
if (row.value > 1) {
duplicates.push(row);
}
}
send(JSON.stringify(duplicates));
}
Read up about _list functions, they're pretty handy and versatile.

Related

Selector query comparing two fields Cloudant/CouchDB/Mango

I have a CouchDB database, which uses a query language Mango - which seems to be the same as Cloudant's query language.
I'm trying to search and compare two fields to each other and only return the relevant results when they're equal.
For example:
{
"_id": "ACCEPT0",
"_rev": "1-92ea4e727271aefd0a2befed0d4bb736",
"OfferID": "OFFER0"
}
{
"_id": "ACCEPT1",
"_rev": "3-986ca6e717b225ac909d644de54d5f7d",
"OfferID": "OFFER3"
}
{
"_id": "OFFER0",
"_rev": "1-2af5f5c7b1c59dd3f0997f748a367cb2",
"From": "merchant1",
"To": "customer1"
}
{
"_id": "OFFER1",
"_rev": "6-f0927c5d4f9fd8a2d2b602f1c265d6d5",
"From": "merchant1",
"To": "customer2"
}
I trying to come up with a query which will, in this example, return "OFFER0" - since OFFER0 exists in an "OfferID"
EDIT (clarification): The query needs to be able to select all the _id's which begin with OFFER and which exist in an OfferID field.
I know I can set this up with a view (as seen from: Cloudant query to return records where 2 fields are equal), but I need this in a Mango query as it'll be running over Hyperledger

You can easily return documents whose _id field starts with "OFFER" with the following query:
{
"selector": {
"_id": {
"$regex": "^OFFER"
}
}
}
but this is likely to be inefficient because Cloudant has to scan the whole database, testing each documents _id field with that regular expression.
A better way to design your data may be to have a type field which distinguishes between the document types in your database e.g.
{
"_id": "OFFER0",
"_rev": "1-2af5f5c7b1c59dd3f0997f748a367cb2",
"type": "offer",
"From": "merchant1",
"To": "customer1"
}
and then a query to return all documents where type = 'offer' becomes:
{
"selector": {
"type": "offer"
}
}
I don't fully understand the part of the question where you say "which exist in an OfferID field." but it's important to note that Cloudant Query & Mango can only query single documents - you can't say "get me all the documents which are offers, where another document has a certain property". Include all the data you need in each document and then you'll be able to query it cleanly.

Elasticsearch mapping of nested structure

I'm looking for some pointers on mapping a somewhat dynamic structure for consumption by Elasticsearch.
The raw structure itself is json, but the problem is that a portion of the structure contains a variable, rather than the outer elements of the structure being static.
To provide a somewhat redacted example, my json looks like this:
"stat": {
"state": "valid",
"duration": 5,
},
"12345-abc": {
"content_length": 5,
"version": 2
}
"54321-xyz": {
"content_length": 2,
"version", 1
}
The first block is easy; Elasticsearch does a great job of mapping the "stat" portion of the structure, and if I were to dump a lot of that data into an index it would work as expected. The problem is that the next 2 blocks are essentially the same thing, but the raw json is formatted in such a way that a unique element has crept into the structure, and Elasticsearch wants to map that by default, generating a map that looks like this:
"stat": {
"properties": {
"state": {
"type": "string"
},
"duration": {
"type": "double"
}
}
},
"12345-abc": {
"properties": {
"content_length": {
"type": "double"
},
"version": {
"type": "double"
}
}
},
"54321-xyz": {
"properties": {
"content_length": {
"type": "double"
},
"version": {
"type": "double"
}
}
}
I'd like the ability to index all of the "content_length" data, but it's getting separated, and with some of the variable names being used, when I drop the data into Kibana I wind up with really long fieldnames that become next to useless.
Is it possible to provide a generic tag to the structure? Or is this more trivially addressed at the json generation phase, with our developers hard coding a generic structure name and adding an identifier field name.
Any insight / help greatly appreciated.
Thanks!

If those keys like 12345-abc are generated and possibly infinite values, it will get hard (if not impossible) to do some useful queries or aggregations. It's not really clear which exact use case you have for analyzing your data, but you should probably have a look at nested objects (https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-objects.html) and generate your input json accordingly to what you want to query for. It seems that you will have better aggregation results if you put these additional objects into an array with a special field containing what is currently your key.
{
"stat": ...,
"things": [
{
"thingkey": "12345-abc",
"content_length": 5,
"version": 2
},
...
]
}

Update MongoDB document using JSON

Is there a way to update a complex MongoDB document from C# using JSON? For example, suppose I have the following document:
{
"name": "John Smith",
"age": 35,
"readingList":
[{
"title": "Title1",
"ISBN": 6246246426724,
"author":
{
"name": "James Johnson",
"age": 40
}
},
{
"title": "Title2",
"ISBN": 3513531513551,
"author":
{
"name": "Sam Hill",
"age": 20
}
}]
}
Now I want to update the age of the second book's author (Sam Hill) from 20 to 21. Suppose I have the following JSON representation:
{
"readingList":
[
{
"title": "Title2",
"author":
{
"age": 21
}
}]
}
Basically the second JSON string is like the first one, minus all the fields and array elements that don't change, except for one field in any array being looked at that uniquely identifies that index. In this case, the "age" field is included since it is being updated with the given value. The "title" field is given to locate the right array element while searching for the field to update. There may also be even more subdocuments and arrays to go through, and the format is not static (it may change at a later time). This is just a simplified example.
Is it possible to pass in something like this to some function and update the correct field that way? Is there something at least similar to this, so I can just pass in some JSON to do the update?
The reason I am looking to do it this way, rather than through simpler means, is because I want to keep track of a history of changes to documents, and if I want to backtrack to an earlier version, I want an easy way to do so that can handle this level of complexity.
UPDATE:
I have some clarifications to make. In this particular scenario I have no way to predict what kinds of changes would need to be made. A change could be made to any field at any time, and that field could be anywhere in the document, possibly at the top level, or within multiple nested subdocuments/arrays. The data we're dealing with is for a separate party that may use it and modify it at will, so we have no control over what they choose to do with it. In addition, there is no fixed schema. The other party could add new fields, including new subdocuments or arrays, or delete them.
The reason I'm asking this question is because I would like to store a history of changes to documents in such a way that I could revert to an older snapshot of the document by applying the changes in reverse. In this case, changing the age from to 20 to 21 would revert the document to an older state (assuming that someone messed with the age beforehand and made it 20, and I wanted to fix it back to 21). Since somebody could make any change they wanted to the system, including to the underlying structure of the data itself, I can't just come up with my own schema, or hardcode a solution that changes specific fields using this specific schema.
In this example, the change in age from 20 to 21 would be from a record in the history whose structure I couldn't predict beforehand. So I am looking for an efficient solution to apply an unpredictable update to a document given a simplified JSON representation of the change to be made.
I am also open to alternatives that don't involve JSON if they are fairly efficient. I brought up JSON because I figured that, given MongoDB's usage of JSON to structure documents, it would make the most sense, and perhaps be superior to something like string manipulation. Another alternative I considered would involve storing the change using some kind of custom dot notation, like this: readingList[ISBN:3513531513551].author.age=21"
This would require me to create a custom function to interpret the string and turn it into something useful though, so it doesn't sound like the best solution.

Hi friend I used below JSON document
{
"_id" : ObjectId("56a99c121f25cc3a3c709151"),
"name" : "John Smith",
"age" : 35,
"readingList" : [
{
"title" : "Title1",
"ISBN" : NumberLong(6246246426724),
"author" : {
"name" : "James Johnson",
"age" : 40
}
},
{
"title" : "Title2",
"ISBN" : NumberLong(3513531513551),
"author" : {
"name" : "Sam Hill",
"age" : "25"
}
}
]
}
I just used condition as author name is Sam Hill and execute below query in C# and its work.
IMongoQuery query = Query.And(Query.EQ("name", "John Smith"), Query.EQ("readingList.author.name", "Sam Hill"));
var result =collection.Update(query,
MongoDB.Driver.Builders.Update.Set("readingList.$.author.age", "21"));

you can query your main document let's assume your main collection is named "books" this is the structure:
{
"id":"123",
"name": "John Smith",
"age": 35,
"readingList":
[{
"title": "Title1",
"ISBN": 6246246426724,
"author":
{
"name": "James Johnson",
"age": 40
}
},
{
"title": "Title2",
"ISBN": 3513531513551,
"author":
{
"name": "Sam Hill",
"age": 20
}
}]
}
// you need a query that returns the main document by id for example, when you have the main document you can query at the one you want to modify in the list and assing it to a varibale let's say readItem, then do the modifications you need and after that you can update only the fields you need using set and onle the element in the array using "$" something like:
readItem.title = "some new title";
readItem.age++;
var update = MongoDB.Driver.Builders.Update.Set("readingList.$", BsonDocumentWrapper.Create(readItem));
Update<Book>(query, update);

Actually I would not advise you to choose this kind of data model because in my experience it will get pretty messy. Still, you might have some very specific requirements which might force you to have this and only this data model.
I would create two collections: persons and readinglists.
persons would look like:
{
"id":"123",
"name": "John Smith",
"age": 35
}
and readinglists would look like (note that it has a compound natural id):
{
"_id": { "personid":"123", "title": "Title1"},
"ISBN": 6246246426724,
"author":
{
"name": "James Johnson",
"age": 40
}
}
Then you can easily update the readinglist:
var query = Query.EQ("_id", new BsonDocument(new BsonElement[]{ new BsonElement("personid":"123"), BsonElement("title":"Title1")}));
readingListCollection.Update(query, Update.Set("author.age": 22));

In your data mode you need to know the array index of the second document. It is better to model readingList attribute as a map. In following example I used isbn as a map key:
{
"id":"123",
"name":"John Smith",
"age":35,
"readingList":{
"6246246426724":{
"title":"Title1",
"ISBN":6246246426724,
"author":{
"name":"James Johnson",
"age":40
}
},
"3513531513551":{
"title":"Title2",
"ISBN":3513531513551,
"author":{
"name":"Sam Hill",
"age":20
}
}
}
}
In this data model you can access second book directly. For instance by dot notation:
db.authors.update(
{ item: "123" },
{ $set: { "readingList.3513531513551.author.age": 22 } }
)
Unfortunately I do know C# notation for that but should be straight forward.

Freebase MQL to list out all commons types for a given word?

I'm trying to figure out how to write a MQL query to get a list of all the types associated to a given word.
For example I tried:
{
"id":null,
"name":null,
"name~=": "SOME_WORD",
"type":"/type/type",
"domain": {
"id": null,
"/freebase/domain_profile/category": {
"id": "/category/commons"
}
}
}
I found this to list out all the Commons types or categories but haven't yet figured out how to narrow it down for a given input.
[{
"id": null,
"name": null,
"type": "/freebase/domain_profile",
"category": {
"id": "/category/commons"
}
}]

There are a couple of different ways to do this with different tradeoffs for each.
Use the Search API with a query like this
https://www.googleapis.com/freebase/v1/search?indent=true&filter=%28all%20name{full}:%22uss%20constitution%22%29
You'll get back JSON results which look like this:
{
"status": "200 OK",
"result": [
{
"mid": "/m/07y14",
"name": "USS Constitution",
"notable": {
"name": "Ship",
"id": "/boats/ship"
},
"lang": "en",
"score": 1401.410400
},
...
You can make the matching more liberal by switching the "{full}" to "{phrase}" which will give you a substring match instead of an exact match.
Caveats:
- You'll only get a single "notable type" and it's fixed by Freebase's (unknown) algorithm
- I don't think there's a way to get both USS Constitution & U.S.S. Constitution results
- You can get a list of all types by adding &mql_output={"type":[]}, but then you lose the "notable" type. I don't think there's a way to get both in a single call.
Use MQL
This query returns the basic information that you want:
[{
"name~=":"uss constitution",
"type":[],
"/common/topic/notable_types" : []
}]
Caveats:
It won't find "uss constitution" which is an alias rather than the primary name (there's a recipe in the MQL cookbook for that though)
It won't find "u.s.s. constitution"
The "notable_types" API is an MQL extension and MQL extensions aren't supported in the new Freebase API, only the legacy deprecated API
You're tied to whatever (unknown) algorithm Freebase used to compute "notability"
Depending on what you are trying to accomplish, you might need something more sophisticated than this (as well as a deeper understanding of what's in Freebase), but this should get you going with the basics.

Did you try:
[{
"name": "David Bowie",
"type": []
}]

How should a JSON response be formatted?

I have a REST service that returns a list of objects. Each object contains objectcode and objectname.
This is my first time building a REST service, so I'm not sure how to format the response.
Should it be:
{
"objects": {
"count": 2,
"object": [
{
"objectcode": "1",
"objectname": "foo"
},
{
"objectcode": "2",
"objectname": "bar"
},
...more objects
]
}
}
OR
[
{
"objectcode": "1",
"objectname": "foo"
},
{
"objectcode": "2",
"objectname": "bar"
},
...more objects
]
I realize this might be a little subjective, but which would be easier to consume? I would also need to support XML formatted response later.

They are the same to consume, as a library handles both just fine. The first one has an advantage over the second though: You will be able to expand the response to include other information additional to the objects (for example, categories) without breaking existing code.
Something like
{
"objects": {
"count": 2,
"object": [
{
"objectcode": "1",
"objectname": "foo"
},
{
"objectcode": "2",
"objectname": "bar"
},
...more objects
]
}
"categories": {
"count": 2,
"category" : [
{ "name": "some category"}
]
}
}
Additionally, the json shouldn't be formatted in any way, so remove whitespace, linebreaks etc. Also, the count isn't really necessary, as it will be saved while parsing the objects themselves.

I often see the first one. Sometimes it's easier to manipulate data to have meta-data. For exemple google API use first one : http://maps.googleapis.com/maps/api/geocode/json?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&sensor=true

It's not only the question of personal preference; it's also the question fo your requirements. For example, if I was in the same situation and I did need object count on client side then I'd go with first approach otherwise I will choose the second one.
Also please note that "classic" REST server mostly will work a bit different way. If some REST function is to return a list of objects then it should return only a list of URLs to those objects. The URLs should be pointing to details endpoints - so by querying each endpoint you may get details on specific single object.

As a client I would prefer the second format. If the first format only includes the number of "objects", this is redundant information.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008