Can I return document data after a FTS match? - couchbase

Suppose I have this data:
{
"test": "Testing1234"
"false": "Falsify"
}
And then using curl, I write this query:
{"explain": true, "fields": [ "*" ], "highlight": {}, "query": { "query": "Testing"}}
I get a response from couchbase. This includes the document id, as well as a locations object that returns details about where my query matched text in the document, including the parent object. All useful information.
However, I do not receive any additional context. For instance, say I have 100 documents with "test": "TestingXXXX" where XXXX is a random string. My search will not provide me with XXXX. Nor does it provide me any way to read additional fields in the same object (for instance, if I wanted to fetch the "false" property). I will simply get 100 different document IDs to query. Thus, it is technically enough information to obtain all the needed information, however it results in me making 100 different requests based on parsed info from the original response.
Is there any way to return context with FTS matches when using the REST API, without simply querying every document that is matched?

You can get the complete objects by issuing the FTS query from within N1QL using the CURL() function, and then joining that up with the objects themselves.
https://developer.couchbase.com/documentation/server/current/n1ql/n1ql-language-reference/curl.html
Your query would have roughly this form:
SELECT *
FROM yourTable
USE KEYS CURL(ftsURL, ftsQuery, ...)
You'll need to wrap the CURL function in some transformation functions to turn the FTS result into an array of ids.
I realize this is quite schematic, since I don't have a full example handy. But work up through these steps:
Issue the FTS query through CURL() in N1QL.
Transform the FTS results into an array of ids.
Embed the request for the array of ids into a SELECT query using USE KEYS.

I figured it out. It's not an issue with the query. The fields were not being indexed. To fix, I changed the index setting "Store Dynamic Fields" to "True". That said the highlighting did return a lot of extra details and I'm sure it also increases the query times quite a bit. The couchbase documentation seemed to imply it is only used for debugging. Thus, I would like to leave this open in case anyone has further suggestions.

Related

How do I best construct complex NiFi routing

I'm a total noob when it comes to NiFi - so please feel free to highlight any stupidity/ignorance.
I'm reading messages from a Kafka topic using NiFi.
Each message contains JSON that contains a field called Function and then a whole bunch of different fields, based on the Function. For example, if Function ="Login", you can expect a username and password field, but if Function = "Pay", you can expect "From", "To" and "Amount" fields.
I need to process each type of Function differently. So, basically, I want to read the message from Kafka, determine the function and then route the message, based on the function to the appropriate set of rules.
It sounds like this should be simple - but for one small complication. I have about 500 different types of Functions. So, I don't want to add a RouteOnAttribute node for each function.
Is there a better way to do this? If this was "real code", I suppose that I'm looking for the difference between an "if" statements and some sort of "switch/case" statement....
You would first use EvaluateJsonPath to extract the function into a flow file attribute, then RouteOnAttribute which would need 500 conditions added to it, and then connect each of those 500 conditions to whatever follow on processing is required. The only other thing you could do is implement a custom processor that handles the 500 conditions internally.

ElasticSearch multiple exact search on field returns no results

I'm struggling with this, which I feel should work but maybe I'm doing something stupid. This search:
{
"query":
{
"bool":
{
"must":[
{"match":{"Element.sourceSystem.name":"Source1 Source2"}}
]
}
}
returns data for both Source1 and Source2. Adding a terms search, as underneath, I would expect to return a subset of the first search with just the Source1s returned. Nothing is returned, when run with the first query or on it's own.
{
"query":
{
"bool":
{
"must":[
{"match":{"Element.sourceSystem.name":"Source1 Source2"}},
{"terms":{"Element.sourceSystem.name":["Source1"]}}
]
}
}
}
I realise this is hard without seeing the documents, but suffice it to say that "Element.sourceSystem.name" exists and is available as the first search works fine - all input gratefully received.
There are some things that are handled differently in match queries than in terms queries.
First of all, a detour to analyzers:
Assuming you are using the standard analyzer of elasticsearch, which consists of a standard tokenizer and some token filters. The standard tokenizer will tokenize (split your text into terms) on spaces, punctuation marks and some other special characters. Details can be found in the Elasticsearch Documentation, so for now let's just say 'each word will be a term'.
The second, very important part of the analyzer is the lowercase filter. It will transform terms into lowercase. This means, later on, searching for Source1 and source1 should yield the same results.
So a short example:
Input : "This is my input text in English." will be analyzed and end up with the following terms: "this", "is", "my", "input", "text", "in", "english".
All of this happens when you index a document into a text field for example. I assume the Element.sourceSystem.name is one of this type, since your normal match query seems to work.
Now, when you issue a match query with "Source1 Source2", the analysis will also happen and transform it into tokens source1 and source2. Internally it will then create 2 term queries in a boolean OR. So either source1 or source2 must match to be a result of your query.
By the way, the match query supports a minimum_should_match property. You could specify, how many terms of your match query need to match.
Here's now the clue with the terms query. It does not analyze the text you provide. It's usually supposed to be used on fields of type keyword. Keyword fields are also not analyzed (for further information, please read the documentation of mapping types - it is actually pretty important). So what does this mean?
If I take my example from above, my index would contain "this", "is", "my", "input", "text", "in", "english".
A match query with English will match, because it will be analyzed to english
A term/s query with English will never match, because there is no term English in my index. It is case sensitive.
I am very positive, if you would use source1 in your terms query, it would match something. However, I highly doubt that your query is the way to go for your use case. Try using normal match queries when querying text fields and (in general - not always applicable) only use terms queries on keyword fields.

Schemaless Support for Elastic Search Queries

Our REST API allows users to add custom schemaless JSON to some of our REST resources, and we need it to be searchable in Elasticsearch. This custom data and its structure can be completely different across resources of the same type.
Consider this example document:
{
"givenName": "Joe",
"username": "joe",
"email": "joe#mailinator.com",
"customData": {
"favoriteColor": "red",
"someObject": {
"someKey": "someValue"
}
}
}
All fields except customData adhere to a schema. customData is always a JSON Object, but all the fields and values within that Object can vary dramatically from resource to resource. There is no guarantee that any given field name or value (or even value type) within customData is the same across any two resources as users can edit these fields however they wish.
What is the best way to support search for this?
We thought a solution would be to just not create any mapping for customData when the index is created, but then it becomes unqueryable (which is contrary to what the ES docs say). This would be the ideal solution if queries on non-mapped properties worked, and there were no performance problems with this approach. However, after running multiple tests for that matter we haven’t been able to get that to work.
Is this something that needs any special configuration? Or are the docs incorrect? Some clarification as to why it is not working would be greatly appreciated.
Since this is not currently working for us, we’ve thought of a couple alternative solutions:
Reindexing: this would be costly as we would need to reindex every index that contains that document and do so every time a user updates a property with a different value type. Really bad for performance, so this is likely not a real option.
Use multi-match query: we would do this by appending a random string to the customData field name every time there is a change in the customData object. For example, this is what the document being indexed would look like:
{
"givenName": "Joe",
"username": "joe",
"email": "joe#mailinator.com",
"customData_03ae8b95-2496-4c8d-9330-6d2058b1bbb9": {
"favoriteColor": "red",
"someObject": {
"someKey": "someValue"
}
}
}
This means ES would create a new mapping for each ‘random’ field, and we would use phrase multi-match query using a "starts with" wild card for the field names when performing the queries. For example:
curl -XPOST 'eshost:9200/test/_search?pretty' -d '
{
"query": {
"multi_match": {
"query" : "red",
"type" : "phrase",
"fields" : ["customData_*.favoriteColor"]
}
}
}'
This could be a viable solution, but we are concerned that having too many mappings like this could affect performance. Are there any performance repercussions for having too many mappings on an index? Maybe periodic reindexing could alleviate having too many mappings?
This also just feels like a hack and something that should be handled by ES natively. Am I missing something?
Any suggestions about any of this would be much appreciated.
Thanks!
You're correct that Elasticsearch is not truly schemaless. If no mapping is specified, Elasticsearch infers field type primitives based upon the first value it sees for that field. Therefore your non-deterministic customData object can get you in trouble if you first see "favoriteColor": 10 followed by "favoriteColor": "red".
For your requirements, you should take a look at SIREn Solutions Elasticsearch plugin which provides a schemaless solution coupled with an advanced query language (using Twig) and a custom Lucene index format to speed up indexing and search operations for non-deterministic data.
Fields with same mapping will be stored as same lucene field in the lucene index (Elasticsearch shard). Different lucene field will have separate inverted index (term dict and index entry) and separate doc values. Lucene is highly optimized to store documents of same field in a compressed way. Using a mapping with different field for different document prevent lucene from doing its optimization.
You should use Elasticsearch Nested Document to search efficiently. The underlying technology is Lucene BlockJoin, which indexes parent/child documents as a document block.

Store multiple authors in to couchbase database

I am a newbie to "couchbase server". What i am looking for is to store 10 author names to couchbase document one after another. Someone please help me whether the structure is like a single document "author" and multiple values
{ id : 1, name : Auther 1}, { id : 2, name : Author 2}
OR store Author 1 to a document and Author 2 to another document.
If so, how can i increment the id automatically before "insert" command.
you can store all authors in a single document
{ doctype : "Authors",
AuthorNames:[
{
id: 1,
Name : "author1"
}
{
id: 2,
Name : "author2"
}
so on
]
IF you want to increase the ID, one is to enter one author name at a time in new document, but ID will be randomly generated and it would not in incremental order.
In Couchbase think more about how your application will be using the data more than how you are want to store it. For example, will your application need to get all of the 10 authors all of the time? If so, then one document might be worthwhile. Perhaps your application needs to only ever read/write one of the authors at a time. Then you might want to put each in their own, but have an object key pattern that makes it so you can get the object really fast. Objects that are used often are kept in the managed cache, other objects that are not used often may fall out of the managed cache...and that is ok.
The other factor is what your reads to writes ratio is on this data.
So like I said, it depends on how your application will be reading and writing your data. Use this as the guidance for how your data should be stored.
The single JSON document is pretty straight forward. The more advanced schema design where each author is in its own document and you access them via object key, might be a bit more complicated, but ultimately faster and more scalable depending on what I already pointed out. I will lay out an example schema and some possibilities.
For the authors, I might create each author JSON document with an object key like this:
authors::ID
Where ID is a value I keep in a special incrementer object that I will called authors::incrementer. Think of that object as a key value pair only holding an integer that happens to be the upper bound of an array. Couchbase SDKs include a special function to increment just such an integer object. With this, my application can put together that object key very quickly. If I want to go after the 5th author, I do a read by object key for "authors::5". If I need to get 10, I do a parallelized BulkGet function and get authors::1 through authors::10. If I want to get all the authors, I get the incrementer object, and get that integer and then to a parallelized bulk get. This way i can get them in order or in whatever order I feel like and I am accessing them by object key which is VERY fast in Couchbase.
All this being said, I could use a view to query this data or the upcoming "SQL for Documents" in Couchbase 4.0 or I can mix and match when I query and when I get objects by their key. Key access will ALWAYS be faster. It is the difference between asking a question then going and getting the object and simply knowing the answer and getting it immediately.

JSON oData.metadata

I have questions about JSON returning from the server using the Microsoft oData API.
Cannot figure it out.
Query1:
http://localhost:63717/odata/City(1)
Fiddler returns the raw data below.
Everything is in its own brackets.
{
"odata.metadata":"http://localhost:63717/odata/$metadata#City/#Element","CityID":1,"CityName":"Minnetonka","CityAddr1":null,"CityAddr2":null,"CityCity":null,"CityState":null,"CityZip":null,"CityPhone":null,"CityFAX":null,"CityExtent":"-93.53,44.88,-93.39,44.93","CityHeaderImage":null
}
Query2:
http://localhost:63717/odata/City?$filter=CityName eq 'Minnetonka'
Fiddler returns the raw data below.
Data is in two sets of bracketed data
{
"odata.metadata":"http://localhost:63717/odata/$metadata#City","value":[
{
"CityID":1,"CityName":"Minnetonka","CityAddr1":null,"CityAddr2":null,"CityCity":null,"CityState":null,"CityZip":null,"CityPhone":null,"CityFAX":null,"CityExtent":"-93.53,44.88,-93.39,44.93","CityHeaderImage":null
}
]
}
What do I have to do to format my JSON coming back for $filters in the oData request?
That odata.metadata is killing me in Query2.
Please explain what I am doing wrong.
In the first example, you have just one City element (denoted by City(1) in the request and #City/#Element in the result path).
In the second example, the value property in result is showing an array of City types (a listing of one or more objects). [ ... ] denotes an array in JavaScript. For a $filter type query, this is what I would expect. You can also see that the response path is less specific (#City instead of #City/#Element).
The path shown in the odata.metadata property value describes the structure of the element being returned, as I showed two examples above. The format of the return data will change depending on how you request it.
If you're having trouble parsing the JSON returned, consider using a library to do the heavy lifting for you. For example:
datajs
JayData
Breeze.js
[Source]
You are not doing anything wrong, the two formats actually represent two different forms of result.
The first you are requesting a single item as you are specifying the key for the entity.
In the second you are potentially asking for a list of entities. The Odata.Metadata is separate in this response otherwise it would be repeated for every item returned and would be a waste in terms of content length.
Because of the way that you are addressing the entity.
With //localhost:63717/odata/City(1) you are addressing one entity ("/entityset/key"). You will always return back one City (if one exists). There is no need for it to return an array because it will never return more than one.
With //localhost:63717/odata/City you are addressing a collection of entities ("/entityset"). 0 to n City entities could be returned, hence the need for a collection.