How to get exact phrase match using BleveSearch? - json

I am searching for synonyms for a particular phrase from a dataset. I have 2 JSON files in which data is stored consisting of synonyms for yes and no . If I query for "not interested" it gives both yes and no phrases/synonyms as result, the expected result is just no phrases/synonyms.
en-gen-yes.json
{
"tag":"en-gen-yes",
"phrases": [
"yes",
"yeah",
"sure",
"suits me",
"interested"
]
}
en-gen-no.json
{
"tag":"en-gen-no",
"phrases": [
"no",
"nope",
"not sure",
"does not suits me",
"not interested"
]
}
query code
query := bleve.NewMatchPhraseQuery("not interested")
req := bleve.NewSearchRequest(query)
req.Fields = []string{"phrases"}
searchResults, err := paraphraseIndex.Search(req)
if err != nil {
log.Fatal(err)
}
if searchResults.Hits.Len() == 0 {
fmt.Println("No matches found")
} else {
for i := 0; i < searchResults.Hits.Len(); {
hit := searchResults.Hits[i]
fmt.Printf("%s\n", hit.Fields["phrases"])
i = i + 1
}
}
The result comes as
[no nope not sure does not suits me not interested]
[yes yeah sure suits me interested]
Expected Result is only
[no nope not sure does not suits me not interested]

The reason that it matches both is that the MatchPhraseQuery you are using will analyze the search terms. You didn't show the IndexMapping here so I can't be sure, but I'll assume you're using the "standard" analyzer. This analyzer removes English stop words, and the English stop word list is defined here:
https://github.com/blevesearch/bleve/blob/master/analysis/lang/en/stop_words_en.go#L281
So, this means that when you do a MatchPhraseQuery for "not interested" you end up just searching for "interested". And that term happens to also be in your "yes" list of synonyms.
It is worth noting that there is a variant called PhraseQuery (without Match) that does exact matching. And while that wouldn't remove the word "not" at search time, it still wouldn't find the match. And the reason is that the word "not" has been removed at index time as well, so and exact match of "not interested" would not find any matches (neither yes or no).
The solution is configure a custom analyzer which either doesn't remove any stop words, or that uses a custom stop word list that doesn't contain the word "not". If you do this, and use it for both indexing and searching, the query you're using should start to work correctly.

Related

How do I search for a string in this JSON with Python

My JSON file looks something like:
{
"generator": {
"name": "Xfer Records Serum",
....
},
"generator": {
"name: "Lennar Digital Sylenth1",
....
}
}
I ask the user for search term and the input is searched for in the name key only. All matching results are returned. It means if I input 's' only then also both the above ones would be returned. Also please explain me how to return all the object names which are generators. The more simple method the better it will be for me. I use json library. However if another library is required not a problem.
Before switching to JSON I tried XML but it did not work.
If your goal is just to search all name properties, this will do the trick:
import re
def search_names(term, lines):
name_search = re.compile('\s*"name"\s*:\s*"(.*' + term + '.*)",?$', re.I)
return [x.group(1) for x in [name_search.search(y) for y in lines] if x]
with open('path/to/your.json') as f:
lines = f.readlines()
print(search_names('s', lines))
which would return both names you listed in your example.
The way the search_names() function works is it builds a regular expression that will match any line starting with "name": " (with varying amount of whitespace) followed by your search term with any other characters around it then terminated with " followed by an optional , and the end of string. Then applies that to each line from the file. Finally it filters out any non-matching lines and returns the value of the name property (the capture group contents) for each match.

Phrase & wildcard queries on Elasticsearch

I am facing some difficulties while trying to create a query that can match only whole phrases, but allows wildcards as well.
Basically I have a filed that contains a string (it is actually a list of strings, but for simplicity I am skipping that), which can contain white spaces or be null, lets call it "color".
For example:
{
...
"color": "Dull carmine pink"
...
}
My queries need to be able to do the following:
search for null values (inclusive and exclusive)
search for non null values (inclusive and exclusive)
search for and match only a whole phrase (inclusive and exclusive). For example:
dull carmine pink --> match
carmine pink --> not a match
same as the last, but with wildcards (inclusive and exclusive). For example:
?ull carmine p* --> match to "Dull carmine pink"
dull carmine* -> match to "Dull carmine pink"
etc.
I have been bumping my head against the wall for a few days with this and I have tried almost every type of query I could think of.
I have only managed to make it work partially with a span_near query with the help of this topic.
So basically I can now:
search for a whole phrase with/without wildcards like this:
{
"span_near": {
"clauses": [
{
"span_term": {"color": "dull"}
},
{
"span_term": {"color": "carmine"}
},
{
"span_multi": {"match": {"wildcard": {"color": "p*"}}}
}
],
"slop": 0,
"in_order": true
}
}
search for null values (inclusive and exclusive) by simple must/must_not queries like this:
{
"must" / "must_not": {'exist': {'field': 'color'}}
}
The problem:
I cannot find a way to make an exclusive span query. The only way I can find is this. But it requires both include & exclude fields, and I am only trying to exclude some fields, all others must be returned. Is there some analog of the "match_all":{} query that can work inside of an span_not's include field? Or perhaps an entire new, more elegant solution?
I found the solution a month ago, but I forgot to post it here.
I do not have an example at hand, but I will try to explain it.
The problem was that the fields I was trying to query were analyzed by elasticsearch before querying. The analyzer in question was dividing them by spaces etc. The solution to this problem is one of the two:
1. If you do not use a custom mapping for the index.
(Meaning if you let elasticsearch to dynamically create the appropriate mapping for your field when you were adding it).
In this case elastic search automatically creates a subfield of the text field called "keyword". This subfield uses the "keyword" analyzer which does not process the data in any way prior to querying.
Which means that queries like:
{
"query": {
"bool": {
"must": [ // must_not
{
"match": {
"user.keyword": "Kim Chy"
}
}
]
}
}
}
and
{
"query": {
"bool": {
"must": [ // must_not
{
"wildcard": {
"user.keyword": "Kim*y"
}
}
]
}
}
}
should work as expected.
However with the default mapping, the keyword field will most likely be case-sensitive. In order for it to be case-insensitive as well, you will need to create a custom mapping, that applies a lower-case (or upper-case) normalizer to the query and keyword field prior to matching.
2. If you use a custom mapping
Basically the same as above, however you will have to create a new subfield (or field) manually that uses the keyword analyzer (and possibly a normalizer in order for it to be case-insensitive).
P.S. As far as I am aware changing of a mapping is no longer possible in elasticsearch. This means that you will have to create a new index with the appropriate mapping, and then reindex your data to the new index.

Jmeter Json Extractor with multiple conditional - failed

I am trying to create a Json Extractor and it`s being a thought activity. I have this json structure:
[
{
"reportType":{
"id":3,
"nomeTipoRelatorio":"etc etc etc",
"descricaoTipoRelatorio":"etc etc etc",
"esExibeSite":"S",
"esExibeEmail":"S",
"esExibeFisico":"N"
},
"account":{
"id":9999999,
"holdersName":"etc etc etc",
"accountNamber":"9999999",
"nickname":null
},
"file":{
"id":2913847,
"typeId":null,
"version":null,
"name":null,
"format":null,
"description":"description",
"typeCode":null,
"size":153196,
"mimeType":null,
"file":null,
"publicationDate":"2018-12-05",
"referenceStartDate":"2018-12-05",
"referenceEndDate":"2018-12-06",
"extension":null,
"fileStatusLog":{
"idArquivo":2913847,
"dhAlteracao":"2018-12-05",
"nmSistema":"SISTEMA X",
"idUsuario":999999,
"reportStatusIndicador":"Z"
}
}
}
]
What I need to do: First of all, I am using the option "Compute concatenation var" and "Match No." as -1. Because the service can bring in the response many of those.
I have to verify, if "reportStatusIndicador" = 'Z' or 'Y', if positive, I have to collect File.Id OR file.FileStatusLog.idArquivo, they are the same, I was trying the first option, in this case the number "2913847", but if come more results, I will collect all File.id`s
With this values in hands, I will continue with a for each for all File.id`s.
My last try, was this combination, after reading a lot and tried many others combinations.
[?(#...file.fileStatusLog.reportStatusIndicador == 'Z' || #...file.fileStatusLog.reportStatusIndicador == 'Y')].file.id
But my debug post processor always appears like this, empty:
filesIds=
Go for $..[?(#.file.fileStatusLog.reportStatusIndicador == 'Z' || #.file.fileStatusLog.reportStatusIndicador == 'Y')].file.id
Demo:
References:
Jayway JsonPath: Inline Predicates
JMeter's JSON Path Extractor Plugin - Advanced Usage Scenarios
I could do it with this pattern:
[?(#.file.fileStatusLog.reportStatusIndicador == 'Z' ||
#.file.fileStatusLog.reportStatusIndicador == 'Y')].file.id
filesIds_ALL=2913755,2913756,2913758,2913759,2913760,2913761,2913762,2913763,2913764,2913765,2913766,2913767,2913768,2913769,2913770

Ruby search for match in large json

I have a pretty huge json file of short lines from a screenplay. I am trying to match keywords to keywords in the json file so I can pull out a line from the json.
The json file structure is like this:
[
"Yeah, well I wasn't looking for a long term relationship. I was on TV. ",
"Ok, yeah, you guys got to put a negative spin on everything. ",
"No no I'm not ready, things are starting to happen. ",
"Ok, it's forgotten. ",
"Yeah, ok. ",
"Hey hey, whoa come on give me a hug... "
]
(plus lots more...2444 lines in total)
So far I have this, but it's not making any matches.
# screenplay is read in from a json file
#screenplay_lines = JSON.parse(#jsonfile.read)
#text_to_find = ["relationship","negative","hug"]
#matching_results = []
#screenplay_lines.each do |line|
if line.match(Regexp.union(#text_to_find))
#matching_results << line
end
end
puts "found #{#matching_results.length} matches..."
puts #matching_results
I'm not getting any hits so not sure what's not working. Plus I'm sure it's a pretty expensive process doing it this way with a large amount of data. Any ideas? Thanks.
Yes, Regexp matching is slower than just checking if the String is included in a line of text. But this also depends on the number of keywords and the length of the lines and much more. So best would be to run at least a micro-benchmark.
lines = [
"Yeah, well I wasn't looking for a long term relationship. I was on TV. ",
"Ok, yeah, you guys got to put a negative spin on everything. ",
"No no I'm not ready, things are starting to happen. ",
"Ok, it's forgotten. ",
"Yeah, ok. ",
"Hey hey, whoa come on give me a hug... "
]
keywords = ["relationship","negative","hug"]
def find1(lines, keywords)
regexp = Regexp.union(keywords)
lines.select { |line| regexp.match(line) }
end
def find2(lines, keywords)
lines.select { |line| keywords.any? { |keyword| line.include?(keyword) } }
end
def find3(lines, keywords)
regexp = Regexp.union(keywords)
lines.select { |line| regexp.match?(line) }
end
require 'benchmark/ips'
Benchmark.ips do |x|
x.compare!
x.report('match') { find1(lines, keywords) }
x.report('include?') { find2(lines, keywords) }
x.report('match?') { find3(lines, keywords) }
end
In this setup the include? variant is way faster:
Comparison:
include?: 288083.4 i/s
match?: 91505.7 i/s - 3.15x slower
match: 65866.7 i/s - 4.37x slower
Please note:
I've moved creation of the regexp out of the loop. It does not need to be created for every line. Creation of a regexp is an expensive operation (your variant clocked in at 1/5 of the speed of the regexp outside of the loop)
match? is only available in Ruby 2.4+, it is faster because it does not assign any match results (side-effect free)
I would not worry to much about performance for 2500 lines of text. If it is fast enough then stop searching for a better solution.
There is a possible solution, try this one:
json_expressions

How do I do a partial match in Elasticsearch?

I have a link like http://drive.google.com and I want to match "google" out of the link.
I have:
query: {
bool : {
must: {
match: { text: 'google'}
}
}
}
But this only matches if the whole text is 'google' (case insensitive, so it also matches Google or GooGlE etc). How do I match for the 'google' inside of another string?
The point is that the ElasticSearch regex you are using requires a full string match:
Lucene’s patterns are always anchored. The pattern provided must match the entire string.
Thus, to match any character (but a newline), you can use .* pattern:
match: { text: '.*google.*'}
^^ ^^
In ES6+, use regexp insted of match:
"query": {
"regexp": { "text": ".*google.*"}
}
One more variation is for cases when your string can have newlines: match: { text: '(.|\n)*google(.|\n)*'}. This awful (.|\n)* is a must in ElasticSearch because this regex flavor does not allow any [\s\S] workarounds, nor any DOTALL/Singleline flags. "The Lucene regular expression engine is not Perl-compatible but supports a smaller range of operators."
However, if you do not plan to match any complicated patterns and need no word boundary checking, regex search for a mere substring is better performed with a mere wildcard search:
{
"query": {
"wildcard": {
"text": {
"value": "*google*",
"boost": 1.0,
"rewrite": "constant_score"
}
}
}
}
See Wildcard search for more details.
NOTE: The wildcard pattern also needs to match the whole input string, thus
google* finds all strings starting with google
*google* finds all strings containing google
*google finds all strings ending with google
Also, bear in mind the only pair of special characters in wildcard patterns:
?, which matches any single character
*, which can match zero or more characters, including an empty one
use wildcard query:
'{"query":{ "wildcard": { "text.keyword" : "*google*" }}}'
For both partial and full text matching ,the following worked
"query" : {
"query_string" : {
"query" : "*searchText*",
"fields" : [
"fieldName"
]
}
I can't find a breaking change disabling regular expressions in match, but match: { text: '.*google.*'} does not work on any of my Elasticsearch 6.2 clusters. Perhaps it is configurable?
Regexp works:
"query": {
"regexp": { "text": ".*google.*"}
}
For partial matching you can either use prefix or match_phrase_prefix.
For a more generic solution you can look into using a different analyzer or defining your own. I am assuming you are using the standard analyzer which would split http://drive.google.com into the tokens "http" and "drive.google.com". This is why the search for just google isn't working because it is trying to compare it to the full "drive.google.com".
If instead you indexed your documents using the simple analyzer it would split it up into "http", "drive", "google", and "com". This will allow you to match anyone of those terms on their own.
using node.js client
tag_name is the field name, value is the incoming search value.
const { body } = await elasticWrapper.client.search({
index: ElasticIndexs.Tags,
body: {
query: {
wildcard: {
tag_name: {
value: `*${value}*`,
boost: 1.0,
rewrite: 'constant_score',
},
},
},
},
});
You're looking for a wildcard search. According to the official documentation, it can be done as follows:
query_string: {
query: `*${keyword}*`,
fields: ["fieldOne", "fieldTwo"],
},
Wildcard searches can be run on individual terms, using ? to replace a single character, and * to replace zero or more characters: qu?ck bro*
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-wildcard
Be careful, though:
Be aware that wildcard queries can use an enormous amount of memory and perform very badly — just think how many terms need to be queried to match the query string "a* b* c*".
Allowing a wildcard at the beginning of a word (eg "*ing") is particularly heavy, because all terms in the index need to be examined, just in case they match. Leading wildcards can be disabled by setting allow_leading_wildcard to false.