Phrase & wildcard queries on Elasticsearch - html

I am facing some difficulties while trying to create a query that can match only whole phrases, but allows wildcards as well.
Basically I have a filed that contains a string (it is actually a list of strings, but for simplicity I am skipping that), which can contain white spaces or be null, lets call it "color".
For example:
{
...
"color": "Dull carmine pink"
...
}
My queries need to be able to do the following:
search for null values (inclusive and exclusive)
search for non null values (inclusive and exclusive)
search for and match only a whole phrase (inclusive and exclusive). For example:
dull carmine pink --> match
carmine pink --> not a match
same as the last, but with wildcards (inclusive and exclusive). For example:
?ull carmine p* --> match to "Dull carmine pink"
dull carmine* -> match to "Dull carmine pink"
etc.
I have been bumping my head against the wall for a few days with this and I have tried almost every type of query I could think of.
I have only managed to make it work partially with a span_near query with the help of this topic.
So basically I can now:
search for a whole phrase with/without wildcards like this:
{
"span_near": {
"clauses": [
{
"span_term": {"color": "dull"}
},
{
"span_term": {"color": "carmine"}
},
{
"span_multi": {"match": {"wildcard": {"color": "p*"}}}
}
],
"slop": 0,
"in_order": true
}
}
search for null values (inclusive and exclusive) by simple must/must_not queries like this:
{
"must" / "must_not": {'exist': {'field': 'color'}}
}
The problem:
I cannot find a way to make an exclusive span query. The only way I can find is this. But it requires both include & exclude fields, and I am only trying to exclude some fields, all others must be returned. Is there some analog of the "match_all":{} query that can work inside of an span_not's include field? Or perhaps an entire new, more elegant solution?

I found the solution a month ago, but I forgot to post it here.
I do not have an example at hand, but I will try to explain it.
The problem was that the fields I was trying to query were analyzed by elasticsearch before querying. The analyzer in question was dividing them by spaces etc. The solution to this problem is one of the two:
1. If you do not use a custom mapping for the index.
(Meaning if you let elasticsearch to dynamically create the appropriate mapping for your field when you were adding it).
In this case elastic search automatically creates a subfield of the text field called "keyword". This subfield uses the "keyword" analyzer which does not process the data in any way prior to querying.
Which means that queries like:
{
"query": {
"bool": {
"must": [ // must_not
{
"match": {
"user.keyword": "Kim Chy"
}
}
]
}
}
}
and
{
"query": {
"bool": {
"must": [ // must_not
{
"wildcard": {
"user.keyword": "Kim*y"
}
}
]
}
}
}
should work as expected.
However with the default mapping, the keyword field will most likely be case-sensitive. In order for it to be case-insensitive as well, you will need to create a custom mapping, that applies a lower-case (or upper-case) normalizer to the query and keyword field prior to matching.
2. If you use a custom mapping
Basically the same as above, however you will have to create a new subfield (or field) manually that uses the keyword analyzer (and possibly a normalizer in order for it to be case-insensitive).
P.S. As far as I am aware changing of a mapping is no longer possible in elasticsearch. This means that you will have to create a new index with the appropriate mapping, and then reindex your data to the new index.

Related

Format for storing json in Amazon DynamoDB

I've got JSON file that looks like this
{
"alliance":{
"name_part_1":[
"Ab",
"Aen",
"Zancl"
],
"name_part_2":[
"aca",
"acia",
"ythrae",
"ytos"
],
"name_part_3":[
"Alliance",
"Bond"
]
}
}
I want to store it in dynamoDB.
The thing is that I want a generator that would take random elements from fields like name_part_1, name_part_2 and others (number of name_parts_x is unlimited and overalls number of items in each parts might be several hundreds) and join them to create a complete word. Like
name_part_1[1] + name_part_2[10] + name_part[3]
My question is that what format I should use to do that effectively? Or NoSQL shouldn't be used for that? Should I refactor JSON for something like
{
"name": "alliance",
"parts": [ "name_part_1", "name_part_2", "name_part_3" ],
"values": [
{ "name_part_1" : [ "Ab ... ] }, { "name_part_2": [ "aca" ... ] }
]
}
This is a perfect use case for DynamoDB.
You can structure like this,
NameParts (Table)
namepart (partition key)
range (hash key)
namepart: name_part_1 range: 0 value: "Ab"
This way each name_part will have its own range and scalable. You can extend it to thousands or even millions.
You can do a batch getitem from the sdk of your choice and join those values.
REST API reference:
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchGetItem.html
Hope it helps.
You can just put the whole document as it is in DynamoDB and then use document path to access the elements you want.
Document Paths
In an expression, you use a document path to tell DynamoDB where to
find an attribute. For a top-level attribute, the document path is
simply the attribute name. For a nested attribute, you construct the
document path using dereference operators.
The following are some examples of document paths. (Refer to the item
shown in Specifying Item Attributes.)
A top-level scalar attribute: ProductDescription A top-level list
attribute. (This will return the entire list, not just some of the
elements.) RelatedItems The third element from the RelatedItems list.
(Remember that list elements are zero-based.) RelatedItems[2] The
front-view picture of the product. Pictures.FrontView All of the
five-star reviews. ProductReviews.FiveStar The first of the five-star
reviews. ProductReviews.FiveStar[0] Note The maximum depth for a
document path is 32. Therefore, the number of dereferences operators
in a path cannot exceed this limit.
Note that each document requires a unique Partition Key.

How to construct a good Regex that matches all strings containing query for better performance in ElasticSearch?

Suppose I create an index in ElasticSearch by simply calling:
PUT strings
Then I insert documents by calling:
POST strings/string/<some_id>
{
"name": "some_string"
}
Now I want to search for all strings that contain the letter 's', for example:
GET strings/string
{
"query": {
"regexp": {
"name": ".*s.*"
}
}
}
Yes, this gives me what I want. However, I read from here that Matching everything like .* is very slow as well as using look-around regular expressions.
Question is, how should I construct the regex in order to do the same thing but with a better performance?

Possible to chain results in N1ql?

I'm currently trying to do a bit of complex N1QL for a project I'm working on, theoretically I could do all of this processing in multiple N1QL calls and by parsing the results each time, however if possible I'd like for this to contained in one call.
What I would like to do is:
filter all documents that contain a "dataSync.test.id" field with more than 1 id
Read back all other ids in that list
Use that list to get other documents containing those ids
Get the "dataSync.test._channels" field for those documents (optionally a filter by docType might help parsing)
This would probably return a list of "dataSync.test._channels"
Is this possible in N1QL? It appears like it might be but I can't get the syntax right.
My data structures look a little like
{
"dataSync": {
"test": {
"_channels": [
"RP"
],
"id": [
"dataSync_user_1015",
"dataSync_user_1010",
"dataSync_user_1005"
],
"_lastUpdatedBy": "TEST"
}
},
...
}
{
"dataSync": {
"test": {
"_channels": [
"RSD"
],
"id": [
"dataSync_user_1010"
],
"_lastUpdatedBy": "TEST"
}
},
...
}
Yes. I think you can do all these.
Initial set of IDs with filtering can be retrieved as a subquery and then you can get subsquent documents by joins.
SELECT fulldoc
FROM (select meta().id as dockey from doc where a=1) as mydoc
INNER JOIN doc fulldoc ON KEYS mydoc.dockey;
There are optimizations that can be done here. Try the sequencing first to ensure you're get the job done.

How do I do a partial match in Elasticsearch?

I have a link like http://drive.google.com and I want to match "google" out of the link.
I have:
query: {
bool : {
must: {
match: { text: 'google'}
}
}
}
But this only matches if the whole text is 'google' (case insensitive, so it also matches Google or GooGlE etc). How do I match for the 'google' inside of another string?
The point is that the ElasticSearch regex you are using requires a full string match:
Lucene’s patterns are always anchored. The pattern provided must match the entire string.
Thus, to match any character (but a newline), you can use .* pattern:
match: { text: '.*google.*'}
^^ ^^
In ES6+, use regexp insted of match:
"query": {
"regexp": { "text": ".*google.*"}
}
One more variation is for cases when your string can have newlines: match: { text: '(.|\n)*google(.|\n)*'}. This awful (.|\n)* is a must in ElasticSearch because this regex flavor does not allow any [\s\S] workarounds, nor any DOTALL/Singleline flags. "The Lucene regular expression engine is not Perl-compatible but supports a smaller range of operators."
However, if you do not plan to match any complicated patterns and need no word boundary checking, regex search for a mere substring is better performed with a mere wildcard search:
{
"query": {
"wildcard": {
"text": {
"value": "*google*",
"boost": 1.0,
"rewrite": "constant_score"
}
}
}
}
See Wildcard search for more details.
NOTE: The wildcard pattern also needs to match the whole input string, thus
google* finds all strings starting with google
*google* finds all strings containing google
*google finds all strings ending with google
Also, bear in mind the only pair of special characters in wildcard patterns:
?, which matches any single character
*, which can match zero or more characters, including an empty one
use wildcard query:
'{"query":{ "wildcard": { "text.keyword" : "*google*" }}}'
For both partial and full text matching ,the following worked
"query" : {
"query_string" : {
"query" : "*searchText*",
"fields" : [
"fieldName"
]
}
I can't find a breaking change disabling regular expressions in match, but match: { text: '.*google.*'} does not work on any of my Elasticsearch 6.2 clusters. Perhaps it is configurable?
Regexp works:
"query": {
"regexp": { "text": ".*google.*"}
}
For partial matching you can either use prefix or match_phrase_prefix.
For a more generic solution you can look into using a different analyzer or defining your own. I am assuming you are using the standard analyzer which would split http://drive.google.com into the tokens "http" and "drive.google.com". This is why the search for just google isn't working because it is trying to compare it to the full "drive.google.com".
If instead you indexed your documents using the simple analyzer it would split it up into "http", "drive", "google", and "com". This will allow you to match anyone of those terms on their own.
using node.js client
tag_name is the field name, value is the incoming search value.
const { body } = await elasticWrapper.client.search({
index: ElasticIndexs.Tags,
body: {
query: {
wildcard: {
tag_name: {
value: `*${value}*`,
boost: 1.0,
rewrite: 'constant_score',
},
},
},
},
});
You're looking for a wildcard search. According to the official documentation, it can be done as follows:
query_string: {
query: `*${keyword}*`,
fields: ["fieldOne", "fieldTwo"],
},
Wildcard searches can be run on individual terms, using ? to replace a single character, and * to replace zero or more characters: qu?ck bro*
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-wildcard
Be careful, though:
Be aware that wildcard queries can use an enormous amount of memory and perform very badly — just think how many terms need to be queried to match the query string "a* b* c*".
Allowing a wildcard at the beginning of a word (eg "*ing") is particularly heavy, because all terms in the index need to be examined, just in case they match. Leading wildcards can be disabled by setting allow_leading_wildcard to false.

reactivemongo - merging two BSONDocuments

I am looking for the most efficient and easy way to merge two BSON Documents. In case of collisions I have already handlers, for example if both documents include Integer, I will sum that, if a string also, if array then will add elements of the other one, etc.
However due to BSONDocument immutable nature it is almost impossible to do something with it. What would be the easiest and fastest way to do merging?
I need to merge the following for example:
{
"2013": {
"09": {
value: 23
}
}
}
{
"2013": {
"09": {
value: 13
},
"08": {
value: 1
}
}
}
And the final document would be:
{
"2013": {
"09": {
value: 36
},
"08": {
value: 1
}
}
}
There is a method in BSONDocument.add, however it doesn't check uniqueness, it means I would have at the end 2 BSON documents with "2013" as a root key, etc.
Thank you!
If I understand you inquiry, you are looking to aggregate field data via composite id. MongoDB has a fairly slick aggregate framework. Part of that framework is the $group pipeline aggregate keyword. This will allow you to specify and _id to group by which could be defined as a field or a document as in your example, as well as perform aggregation using accumulators such as $sum.
Here is a link to the manual for the operators you will probably need to use.
http://docs.mongodb.org/manual/reference/operator/aggregation/group/
Also, please remove the "merge" tag from your original inquiry to reduce confusion. Many MongoDB drivers include a Merge function as part of the BsonDocument representation as a way to consolidate two BsonDocuments into a single BsonDocument linearly or via element overwrites and it has no relation to aggregation.
Hope this helps.
ndh