reactivemongo - merging two BSONDocuments - json

I am looking for the most efficient and easy way to merge two BSON Documents. In case of collisions I have already handlers, for example if both documents include Integer, I will sum that, if a string also, if array then will add elements of the other one, etc.
However due to BSONDocument immutable nature it is almost impossible to do something with it. What would be the easiest and fastest way to do merging?
I need to merge the following for example:
{
"2013": {
"09": {
value: 23
}
}
}
{
"2013": {
"09": {
value: 13
},
"08": {
value: 1
}
}
}
And the final document would be:
{
"2013": {
"09": {
value: 36
},
"08": {
value: 1
}
}
}
There is a method in BSONDocument.add, however it doesn't check uniqueness, it means I would have at the end 2 BSON documents with "2013" as a root key, etc.
Thank you!

If I understand you inquiry, you are looking to aggregate field data via composite id. MongoDB has a fairly slick aggregate framework. Part of that framework is the $group pipeline aggregate keyword. This will allow you to specify and _id to group by which could be defined as a field or a document as in your example, as well as perform aggregation using accumulators such as $sum.
Here is a link to the manual for the operators you will probably need to use.
http://docs.mongodb.org/manual/reference/operator/aggregation/group/
Also, please remove the "merge" tag from your original inquiry to reduce confusion. Many MongoDB drivers include a Merge function as part of the BsonDocument representation as a way to consolidate two BsonDocuments into a single BsonDocument linearly or via element overwrites and it has no relation to aggregation.
Hope this helps.
ndh

Related

How can I query for multiple values after a wildcard?

I have a json object like so:
{
_id: "12345",
identifier: [
{
value: "1",
system: "system1",
text: "text!"
},
{
value: "2",
system: "system1"
}
]
}
How can I use the XDevAPI SearchConditionStr to look for the specific combination of value + system in the identifier array? Something like this, but this doesn't seem to work...
collection.find("'${identifier.value}' IN identifier[*].value && '${identifier.system} IN identifier[*].system")
By using the IN operator, what happens underneath the covers is basically a call to JSON_CONTAINS().
So, if you call:
collection.find(":v IN identifier[*].value && :s IN identifier[*].system")
.bind('v', '1')
.bind('s', 'system1')
.execute()
What gets executed, in the end, is (simplified):
JSON_CONTAINS('["1", "2"]', '"2"') AND JSON_CONTAINS('["system1", "system1"]', '"system1"')
In this case, both those conditions are true, and the document will be returned.
The atomic unit is the document (not a slice of that document). So, in your case, regardless of the value of value and/or system, you are still looking for the same document (the one whose _id is '12345'). Using such a statement, the document is either returned if all search values are part of it, and it is not returned if one is not.
For instance, the following would not yield any results:
collection.find(":v IN identifier[*].value && :s IN identifier[*].system")
.bind('v', '1')
.bind('s', 'system2')
.execute()
EDIT: Potential workaround
I don't think using the CRUD API will allow to perform this kind of "cherry-picking", but you can always use SQL. In that case, one strategy that comes to mind is to use JSON_SEARCH() for retrieving an array of paths corresponding to each value in the scope of identifier[*].value and identifier[*].system i.e. the array indexes and use JSON_OVERLAPS() to ensure they are equal.
session.sql(`select * from collection WHERE json_overlaps(json_search(json_extract(doc, '$.identifier[*].value'), 'all', ?), json_search(json_extract(doc, '$.identifier[*].system'), 'all', ?))`)
.bind('2', 'system1')
.execute()
In this case, the result set will only include documents where the identifier array contains at least one JSON object element where value is equal to '2' and system is equal to system1. The filter is effectively applied over individual array items and not in aggregate, like on a basic IN operation.
Disclaimer: I'm the lead developer of the MySQL X DevAPI Connector for Node.js

Phrase & wildcard queries on Elasticsearch

I am facing some difficulties while trying to create a query that can match only whole phrases, but allows wildcards as well.
Basically I have a filed that contains a string (it is actually a list of strings, but for simplicity I am skipping that), which can contain white spaces or be null, lets call it "color".
For example:
{
...
"color": "Dull carmine pink"
...
}
My queries need to be able to do the following:
search for null values (inclusive and exclusive)
search for non null values (inclusive and exclusive)
search for and match only a whole phrase (inclusive and exclusive). For example:
dull carmine pink --> match
carmine pink --> not a match
same as the last, but with wildcards (inclusive and exclusive). For example:
?ull carmine p* --> match to "Dull carmine pink"
dull carmine* -> match to "Dull carmine pink"
etc.
I have been bumping my head against the wall for a few days with this and I have tried almost every type of query I could think of.
I have only managed to make it work partially with a span_near query with the help of this topic.
So basically I can now:
search for a whole phrase with/without wildcards like this:
{
"span_near": {
"clauses": [
{
"span_term": {"color": "dull"}
},
{
"span_term": {"color": "carmine"}
},
{
"span_multi": {"match": {"wildcard": {"color": "p*"}}}
}
],
"slop": 0,
"in_order": true
}
}
search for null values (inclusive and exclusive) by simple must/must_not queries like this:
{
"must" / "must_not": {'exist': {'field': 'color'}}
}
The problem:
I cannot find a way to make an exclusive span query. The only way I can find is this. But it requires both include & exclude fields, and I am only trying to exclude some fields, all others must be returned. Is there some analog of the "match_all":{} query that can work inside of an span_not's include field? Or perhaps an entire new, more elegant solution?
I found the solution a month ago, but I forgot to post it here.
I do not have an example at hand, but I will try to explain it.
The problem was that the fields I was trying to query were analyzed by elasticsearch before querying. The analyzer in question was dividing them by spaces etc. The solution to this problem is one of the two:
1. If you do not use a custom mapping for the index.
(Meaning if you let elasticsearch to dynamically create the appropriate mapping for your field when you were adding it).
In this case elastic search automatically creates a subfield of the text field called "keyword". This subfield uses the "keyword" analyzer which does not process the data in any way prior to querying.
Which means that queries like:
{
"query": {
"bool": {
"must": [ // must_not
{
"match": {
"user.keyword": "Kim Chy"
}
}
]
}
}
}
and
{
"query": {
"bool": {
"must": [ // must_not
{
"wildcard": {
"user.keyword": "Kim*y"
}
}
]
}
}
}
should work as expected.
However with the default mapping, the keyword field will most likely be case-sensitive. In order for it to be case-insensitive as well, you will need to create a custom mapping, that applies a lower-case (or upper-case) normalizer to the query and keyword field prior to matching.
2. If you use a custom mapping
Basically the same as above, however you will have to create a new subfield (or field) manually that uses the keyword analyzer (and possibly a normalizer in order for it to be case-insensitive).
P.S. As far as I am aware changing of a mapping is no longer possible in elasticsearch. This means that you will have to create a new index with the appropriate mapping, and then reindex your data to the new index.

How to construct a good Regex that matches all strings containing query for better performance in ElasticSearch?

Suppose I create an index in ElasticSearch by simply calling:
PUT strings
Then I insert documents by calling:
POST strings/string/<some_id>
{
"name": "some_string"
}
Now I want to search for all strings that contain the letter 's', for example:
GET strings/string
{
"query": {
"regexp": {
"name": ".*s.*"
}
}
}
Yes, this gives me what I want. However, I read from here that Matching everything like .* is very slow as well as using look-around regular expressions.
Question is, how should I construct the regex in order to do the same thing but with a better performance?

how to read/interpret json file to define mysql schema

I have been tasked with mapping a json file to a mysql database and I am trying to define the appropriate schema a sample of the json file is below
"configurationItems":[
{
"ARN":"",
"availabilityZone":"",
"awsAccountId":"hidden from sight ",
"awsRegion":"",
"configuration":{
"amiLaunchIndex":,
"architecture":"",
"blockDeviceMappings":[
{
"deviceName":"",
"ebs":{
"attachTime":"",
"deleteOnTermination":true,
"status":"attached",
"volumeId":""
}
}
],
"clientToken":"",
"ebsOptimized":,
"hypervisor":"",
"imageId":"",
"instanceId":"",
"instanceType":"",
"kernelId":"aki-",
"keyName":"",
"launchTime":"",
"monitoring":{
"state":""
},
"networkInterfaces":[
{ etc
am I right in thinking that the way to do this is essentially wherever there is a bracket /child element there would be a new table eg; configuration items down to aws region would be in a table then configuration through architecture followed by block device mappings etc etc if that is the case then where would the client token through lanch time belong ? many thanks in advance folks
That certainly is a way to use it.
It gives a more parent child relation approach to the setup.
E.g.
"blockDeviceMappings":[
{
"deviceName":"/dev/sda1",
"ebs":{
"attachTime":"2014-01-06T10:37:40.000Z",
"deleteOnTermination":true,
"status":"attached",
"volumeId":""
}
}
]
Probably could have more than one devices so it would be a 1 to many relation.

Couchbase Multiple Keys

I presume a simple question. I have the following data.
I want to search for all rows where the ID is > 2 but < 8 and the Price is > 30
I have used various versions of: startkey=["2", null] or even something like startkey=["2", "30"] just for testing.
It only ever seems to run both conditions on the first row. So if I do: startkey=["2", "30"] then I get back:
{"id":"3","key":["3","30"],"value":null},
{"id":"4","key":["4","30"],"value":null},
{"id":"5","key":["5","20"],"value":null},
{"id":"6","key":["6","60"],"value":null},
{"id":"8","key":["8","60"],"value":null}
Why is row 5 there?
I am starting to get the view that I need to handle this in the code (.net) and make multiple calls somehow... I can't seem to find anything on this that works....
Note: I have tried doing say a loop with for (i = 0; i < doc.ID.length; i++) and then using doc.ID[i] but it never returns anything....
Currently I just have
function (doc, meta) {
emit([doc.ID, doc.Price ],null);
}
Essentially I want to have a search where there are 5 input keys that a user has. So do I need to make 5 calls and then keep taking data from the previous output as the source for the next???
Other references I have looked at include: the manual
Thanks in advance,
Kindest Regards
Robin
This is a common misconception, with a compound array index key, it's still treated as a string, therefore the index key [2,10] is actually "[2,10]", and the index key [5,20], is actually "[5,20]".
So the reason that startkey=["2", "30"]shows the {"id":"5","key":["5","20"],"value":null}, row is because as a string it is > startkey.
Likewise, the Query startkey=[2,10]&endkey=[5,10] returns
{"total_rows":7,"rows":[
{"id":"2","key":[2,20],"value":null},
{"id":"3","key":[3,30],"value":null},
{"id":"4","key":[4,30],"value":null}
]
}
because startkey="[2,10]" < "[2,20]" && "[4,30]" < "[5,10]"=endkey, but "[5,20]" is not within that string Range.
Range Queries with startkey and endkey
startkey => endkey is a Range query using strcmp(), the group and group level is based on the string, where the comma is separating string tokens.
A Good Reference Link (since Couchbase Views work much like Apache CouchDB Views (inspired by them))
http://wiki.apache.org/couchdb/View_collation#Collation_Specification
Spatial View/Query
To achieve the result you are trying for, you could also write a Spatial View to have multi-dimensional Queries, numeric only. While you might not initially think of it
function (doc, meta) {
emit({
type: "Point",
coordinates: [doc.ID, doc.Price]
}, meta.id);
}
The Query would be a Bounding Box Query:
&bbox=2,0,8,30
{"total_rows":0,"rows":[
{"id":"2","bbox":[2,20,2,20],"geometry":{"type":"Point","coordinates":[2,20]},"value":"2"},
{"id":"3","bbox":[3,30,3,30],"geometry":{"type":"Point","coordinates":[3,30]},"value":"3"},
{"id":"4","bbox":[4,30,4,30],"geometry":{"type":"Point","coordinates":[4,30]},"value":"4"},
{"id":"5","bbox":[5,20,5,20],"geometry":{"type":"Point","coordinates":[5,20]},"value":"5"}
]
}
Another Query:
&bbox=2,30,8,30
{"total_rows":0,"rows":[
{"id":"3","bbox":[3,30,3,30],"geometry":{"type":"Point","coordinates":[3,30]},"value":"3"},
{"id":"4","bbox":[4,30,4,30],"geometry":{"type":"Point","coordinates":[4,30]},"value":"4"}
]
}