How do I do a partial match in Elasticsearch? - json

I have a link like http://drive.google.com and I want to match "google" out of the link.
I have:
query: {
bool : {
must: {
match: { text: 'google'}
}
}
}
But this only matches if the whole text is 'google' (case insensitive, so it also matches Google or GooGlE etc). How do I match for the 'google' inside of another string?

The point is that the ElasticSearch regex you are using requires a full string match:
Lucene’s patterns are always anchored. The pattern provided must match the entire string.
Thus, to match any character (but a newline), you can use .* pattern:
match: { text: '.*google.*'}
^^ ^^
In ES6+, use regexp insted of match:
"query": {
"regexp": { "text": ".*google.*"}
}
One more variation is for cases when your string can have newlines: match: { text: '(.|\n)*google(.|\n)*'}. This awful (.|\n)* is a must in ElasticSearch because this regex flavor does not allow any [\s\S] workarounds, nor any DOTALL/Singleline flags. "The Lucene regular expression engine is not Perl-compatible but supports a smaller range of operators."
However, if you do not plan to match any complicated patterns and need no word boundary checking, regex search for a mere substring is better performed with a mere wildcard search:
{
"query": {
"wildcard": {
"text": {
"value": "*google*",
"boost": 1.0,
"rewrite": "constant_score"
}
}
}
}
See Wildcard search for more details.
NOTE: The wildcard pattern also needs to match the whole input string, thus
google* finds all strings starting with google
*google* finds all strings containing google
*google finds all strings ending with google
Also, bear in mind the only pair of special characters in wildcard patterns:
?, which matches any single character
*, which can match zero or more characters, including an empty one

use wildcard query:
'{"query":{ "wildcard": { "text.keyword" : "*google*" }}}'

For both partial and full text matching ,the following worked
"query" : {
"query_string" : {
"query" : "*searchText*",
"fields" : [
"fieldName"
]
}

I can't find a breaking change disabling regular expressions in match, but match: { text: '.*google.*'} does not work on any of my Elasticsearch 6.2 clusters. Perhaps it is configurable?
Regexp works:
"query": {
"regexp": { "text": ".*google.*"}
}

For partial matching you can either use prefix or match_phrase_prefix.

For a more generic solution you can look into using a different analyzer or defining your own. I am assuming you are using the standard analyzer which would split http://drive.google.com into the tokens "http" and "drive.google.com". This is why the search for just google isn't working because it is trying to compare it to the full "drive.google.com".
If instead you indexed your documents using the simple analyzer it would split it up into "http", "drive", "google", and "com". This will allow you to match anyone of those terms on their own.

using node.js client
tag_name is the field name, value is the incoming search value.
const { body } = await elasticWrapper.client.search({
index: ElasticIndexs.Tags,
body: {
query: {
wildcard: {
tag_name: {
value: `*${value}*`,
boost: 1.0,
rewrite: 'constant_score',
},
},
},
},
});

You're looking for a wildcard search. According to the official documentation, it can be done as follows:
query_string: {
query: `*${keyword}*`,
fields: ["fieldOne", "fieldTwo"],
},
Wildcard searches can be run on individual terms, using ? to replace a single character, and * to replace zero or more characters: qu?ck bro*
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-wildcard
Be careful, though:
Be aware that wildcard queries can use an enormous amount of memory and perform very badly — just think how many terms need to be queried to match the query string "a* b* c*".
Allowing a wildcard at the beginning of a word (eg "*ing") is particularly heavy, because all terms in the index need to be examined, just in case they match. Leading wildcards can be disabled by setting allow_leading_wildcard to false.

Related

Phrase & wildcard queries on Elasticsearch

I am facing some difficulties while trying to create a query that can match only whole phrases, but allows wildcards as well.
Basically I have a filed that contains a string (it is actually a list of strings, but for simplicity I am skipping that), which can contain white spaces or be null, lets call it "color".
For example:
{
...
"color": "Dull carmine pink"
...
}
My queries need to be able to do the following:
search for null values (inclusive and exclusive)
search for non null values (inclusive and exclusive)
search for and match only a whole phrase (inclusive and exclusive). For example:
dull carmine pink --> match
carmine pink --> not a match
same as the last, but with wildcards (inclusive and exclusive). For example:
?ull carmine p* --> match to "Dull carmine pink"
dull carmine* -> match to "Dull carmine pink"
etc.
I have been bumping my head against the wall for a few days with this and I have tried almost every type of query I could think of.
I have only managed to make it work partially with a span_near query with the help of this topic.
So basically I can now:
search for a whole phrase with/without wildcards like this:
{
"span_near": {
"clauses": [
{
"span_term": {"color": "dull"}
},
{
"span_term": {"color": "carmine"}
},
{
"span_multi": {"match": {"wildcard": {"color": "p*"}}}
}
],
"slop": 0,
"in_order": true
}
}
search for null values (inclusive and exclusive) by simple must/must_not queries like this:
{
"must" / "must_not": {'exist': {'field': 'color'}}
}
The problem:
I cannot find a way to make an exclusive span query. The only way I can find is this. But it requires both include & exclude fields, and I am only trying to exclude some fields, all others must be returned. Is there some analog of the "match_all":{} query that can work inside of an span_not's include field? Or perhaps an entire new, more elegant solution?
I found the solution a month ago, but I forgot to post it here.
I do not have an example at hand, but I will try to explain it.
The problem was that the fields I was trying to query were analyzed by elasticsearch before querying. The analyzer in question was dividing them by spaces etc. The solution to this problem is one of the two:
1. If you do not use a custom mapping for the index.
(Meaning if you let elasticsearch to dynamically create the appropriate mapping for your field when you were adding it).
In this case elastic search automatically creates a subfield of the text field called "keyword". This subfield uses the "keyword" analyzer which does not process the data in any way prior to querying.
Which means that queries like:
{
"query": {
"bool": {
"must": [ // must_not
{
"match": {
"user.keyword": "Kim Chy"
}
}
]
}
}
}
and
{
"query": {
"bool": {
"must": [ // must_not
{
"wildcard": {
"user.keyword": "Kim*y"
}
}
]
}
}
}
should work as expected.
However with the default mapping, the keyword field will most likely be case-sensitive. In order for it to be case-insensitive as well, you will need to create a custom mapping, that applies a lower-case (or upper-case) normalizer to the query and keyword field prior to matching.
2. If you use a custom mapping
Basically the same as above, however you will have to create a new subfield (or field) manually that uses the keyword analyzer (and possibly a normalizer in order for it to be case-insensitive).
P.S. As far as I am aware changing of a mapping is no longer possible in elasticsearch. This means that you will have to create a new index with the appropriate mapping, and then reindex your data to the new index.

What's the difference between the 'originalText' and 'word' keys in a token?

When using CoreNLPParser from NLTK with CoreNLP Server, the resulting tokens contain both an 'originalText' key and a 'word' key.
What's the difference between the two? Is there any documentation about them?
I've only found this issue, which mentioned the origintalText key, but it doesn't answer my questions.
from nltk.parse.corenlp import CoreNLPParser
corenlp_parser = CoreNLPParser('http://localhost:9000', encoding='utf8')
text = u'我家没有电脑。'
result = corenlp_parser.api_call(text, {'annotators': 'tokenize,ssplit'})
print(result)
prints
{
"sentences":[
{
"index":0,
"tokens":[
{
"index":1,
"word":"我家",
"originalText":"我家",
"characterOffsetBegin":0,
"characterOffsetEnd":2
},
{
"index":2,
"word":"没有",
"originalText":"没有",
"characterOffsetBegin":2,
"characterOffsetEnd":4
},
{
"index":3,
"word":"电脑",
"originalText":"电脑",
"characterOffsetBegin":4,
"characterOffsetEnd":6
},
{
"index":4,
"word":"。",
"originalText":"。",
"characterOffsetBegin":6,
"characterOffsetEnd":7
}
]
}
]
}
Update:
It seems the Token implements HasWord and HasOriginalText
A word is transformed a little bit to make it, e.g., possible to print it in an S-Expression (i.e., a parse tree). So, parentheses and other braces become tokens like -LRB- (left round brace). In addition, quotes are normalized to be backticks (``) and forward ticks ('') and some other little things.
originalText, by contrast, is the literal original text of the token that can be used to reconstruct the original sentence.

How to construct a good Regex that matches all strings containing query for better performance in ElasticSearch?

Suppose I create an index in ElasticSearch by simply calling:
PUT strings
Then I insert documents by calling:
POST strings/string/<some_id>
{
"name": "some_string"
}
Now I want to search for all strings that contain the letter 's', for example:
GET strings/string
{
"query": {
"regexp": {
"name": ".*s.*"
}
}
}
Yes, this gives me what I want. However, I read from here that Matching everything like .* is very slow as well as using look-around regular expressions.
Question is, how should I construct the regex in order to do the same thing but with a better performance?

Elastic search: negative boost for term query

At the moment I'm using queries like the following with positive boosting of a term.
"query": {
"bool" : {
"must" : {
"term" : { "title" : {value :"word", boost: 2.0}}
}
}
}
This type of query is described here.
I would like to know if it is possible to use negative boosting of term just like the above, but instead of 2.0 a -2.0. So like this:
"query": {
"bool" : {
"must" : {
"term" : { "title" : {value :"word", boost: -2.0}}
}
}
}
I couldn't find any documentation on it. It only tells the default value for boost is 1.0. And all examples use positive boosting. There is however some kind of negative boost ( described here), but that is boosting queries instead of terms.
The best way is to try it out ;-) If you'd get a search exception or something you'll know negative boosts are not allowed. But if you try you'll see that it's not the case and negative boosts are perfectly valid and impact the scores.
As you saw, you can also use boosting_query with a negative_boost which will demote results that match a given query (e.g. those documents having "title": "word"). Watch out, though, than when using this query the negative_boost value must be a "positive" number as it will already be negated within the source code.

Difference Between Two Mongo Queries

what is the difference between two mongo queries.
db.test.find({"field" : "Value"})
db.test.find({field : "Value"})
mongo shell accepts both.
There is no difference in your example.
The problem happens when your field names contain characters which cannot be a part of an identifier in Javascript (because the query engine is run in a javascript repl/shell)
For example user-name because there is a hyphen in it.
Then you would have to query like db.test.find({"user-name" : "Value"})
For the mongo shell there is no actual difference, but in some other language cases it does matter.
The actual case here is presenting what is valid JSON, and with myself as a given example, I try to do this in responses on this forum and others as JSON is a data format that can easily be "parsed" into native data structures, where alternate "JavaScript" notation may not be translated so easily.
There are certain cases where the quoting is required, as in:
db.test.find({ "field-value": 1 })
or:
db.test.find({ "field.value": 1 })
As the values would otherwise be "invalid JavaScript".
But the real point here is adhering to the JSON form.
You can understand with example: suppose that you have test collection with two records
{
'_id': ObjectId("5370a826fc55bb23128b4568"),
'name': 'nanhe'
}
{
'_id': ObjectId("5370a75bfc55bb23128b4567"),
'your name': 'nanhe'
}
db.test.find({'your name':'nanhe'});
{ "_id" : ObjectId("5370a75bfc55bb23128b4567"), "your name" : "nanhe" }
db.test.find({your name:'nanhe'});
SyntaxError: Unexpected identifier