What's the difference between the 'originalText' and 'word' keys in a token? - nltk

When using CoreNLPParser from NLTK with CoreNLP Server, the resulting tokens contain both an 'originalText' key and a 'word' key.
What's the difference between the two? Is there any documentation about them?
I've only found this issue, which mentioned the origintalText key, but it doesn't answer my questions.
from nltk.parse.corenlp import CoreNLPParser
corenlp_parser = CoreNLPParser('http://localhost:9000', encoding='utf8')
text = u'我家没有电脑。'
result = corenlp_parser.api_call(text, {'annotators': 'tokenize,ssplit'})
print(result)
prints
{
"sentences":[
{
"index":0,
"tokens":[
{
"index":1,
"word":"我家",
"originalText":"我家",
"characterOffsetBegin":0,
"characterOffsetEnd":2
},
{
"index":2,
"word":"没有",
"originalText":"没有",
"characterOffsetBegin":2,
"characterOffsetEnd":4
},
{
"index":3,
"word":"电脑",
"originalText":"电脑",
"characterOffsetBegin":4,
"characterOffsetEnd":6
},
{
"index":4,
"word":"。",
"originalText":"。",
"characterOffsetBegin":6,
"characterOffsetEnd":7
}
]
}
]
}
Update:
It seems the Token implements HasWord and HasOriginalText

A word is transformed a little bit to make it, e.g., possible to print it in an S-Expression (i.e., a parse tree). So, parentheses and other braces become tokens like -LRB- (left round brace). In addition, quotes are normalized to be backticks (``) and forward ticks ('') and some other little things.
originalText, by contrast, is the literal original text of the token that can be used to reconstruct the original sentence.

Related

Regex Remove Spaces in JSON String

I have a wider automation which populates a lookup table and then serializes the data into a JSON file as this is my desired output.
I am required to remove the spaces once in the JSON format of the lookup column headers.
I am looking to see if it is possible to have a regex which will identify the headers and be able to remove the spaces.
JSON String below:
[
{
"INVOLVED PARTY ID":" 9445999606",
"CUSTOMER NUMBER":" 9445999606",
"PRODUCT":"Current Account",
"LAST UPDATED":"20/02/2020 10:33:00",
"APPLICATION STATUS":"Clearing Handbrake",
"PROGRESS":"Progress",
"APPLICANT":" ACCEPT FLEX INDICATOR Y",
"QUESTION 3 - HEART/CANCER CONDITIONS":null,
}
]
Desired output after regex manipulation
[
{
"INVOLVEDPARTYID":" 9445999606",
"CUSTOMERNUMBER":" 9445999606",
"PRODUCT":"Current Account",
"LASTUPDATED":"20/02/2020 10:33:00",
"APPLICATIONSTATUS":"Clearing Handbrake",
"PROGRESS":"Progress",
"APPLICANT":" ACCEPT FLEX INDICATOR Y",
"QUESTION3-HEART/CANCERCONDITIONS":null,
}
]
Notice only the spaces within the headers have been removed.
Any help on the regex string would be much appreciated or point me in the right direction.
Well, this one works fine:
(?<=\"[A-Z0-9 /-]*) (?=[A-Z0-9 /-]*\":)
It has two non-capturing groups:
Catches alphabets (capital), digits, space, hyphen and slash followed by a double quotation mark.
Catches all the same char set before double quotation mark and a colon.
In between there is the space which gets captured.
Check this out https://regexr.com/4vogd
The logic here is to first creating a new empty result object, iterate over prev object keys, remove the whitespace from it, then assign it to result object as key and put the prev value (intact) as the this(filtered key)'s value;
const yourData =[
{
"INVOLVED PARTY ID":" 9445999606",
"CUSTOMER NUMBER":" 9445999606",
"PRODUCT":"Current Account",
"LAST UPDATED":"20/02/2020 10:33:00",
"APPLICATION STATUS":"Clearing Handbrake",
"PROGRESS":"Progress",
"APPLICANT":" ACCEPT FLEX INDICATOR Y",
"QUESTION 3 - HEART/CANCER CONDITIONS":null,
}
];
let newData = yourData.map(obj=>{
let regexedObj = {};
Object.keys(obj).forEach( prevKey => {
//pattern can be /\s/ too, depends on use-case
const regexedKey = prevKey.replace(/ /g,'')
regexedObj[regexedKey] = obj[prevKey]
})
return regexedObj
})
console.log(newData)

How to validate Sub-Sets of JSON Keys using match contains when there are nested JSON's in the response

From a response, I extracted a subset like this.
{
"base": {
"first": {
"code": "1",
"description": "Its First"
},
"second": {
"code": "2",
"description": "Its Second"
},
"default": {
"last": {
"code": "last",
"description": "No"
}
}
}
}
If I need to do a single validation using And match X contains to check
Inside first the Code is 1
Inside default-last the code is last?
Instead of using json path for every validation, I am trying to extract a specific portion and validate it. If there is no nested json paths, I can do it very easily using And match X contains, however when there are nested jsons, I am not able to do it.
Does this work for you:
* def first = get[0] response..first
* match first.code == '1'
* def last = get[0] response..default.last
* match last.code == 'last'
Edit: ok looks like you want to condense into one line as far as possible, more importantly to be able to do contains in nested nodes. Personally, I find this sometimes to be not worth the trouble, but here goes.
Refer also to these short-cuts: https://github.com/intuit/karate#contains-short-cuts
* def first = { code: "1" }
* match response.base.first contains first
* match response.base contains { first: '#(^first)' }
* def last = { code: 'last' }
* match response.base contains { first: '#(^first)', default: { last: '#(^last)' } }
Mhmm, My question is slightly different I think.
For example if I directly point to the first using a json path and save it to a variable savedResponse, I can do this validation
And match savedResponse contains {code: "1"}
If there were 10 Key value combinations under first and if I need to validate 6 of those, I can use the same json path and I can easily do it using match contains
Similiar way if I save the above response to a variable savedResponse, how I can validate mutliple things using match contains, in this. The below statement will not work anyway.
And match savedResponse contains {first:{code:"1"}, last:{code:"last"}}
However if I modify something will it work?

JSONPath Syntax when dot in key

Please forgive me if I use the incorrect terminology, I am quite the novice.
I have some simple JSON:
{
"properties": {
"footer.navigationLinks": {
"group": "layout"
, "default": [
{
"text": "Link a"
, "href": "#"
}
]
}
}
}
I am trying to pinpoint "footer.navigationLinks" but I am having trouble with the dot in the key name. I am using http://jsonpath.com/ and when I enter
$.properties['footer.navigationLinks']
I get 'No match'. If I change the key to "footernavigationLinks" it works but I cannot control the key names in the JSON file.
Please can someone help me target that key name?
Having a json response:
{
"0": {
"SKU": "somevalue",
"Merchant.Id": 234
}
}
I can target a key with a . (dot) in the name.
jsonPath.getJsonObject("0.\"Merchant.Id\"")
Note: the quotes and the fact that they are escaped.
Note not sure of other versions, but I'm using
'com.jayway.restassured', name: 'json-path', version: '2.9.0'
A few samples/solutions I've seen, was using singe quotes with brackets, but did not work for me.
For information, jsonpath.com has been patched since the question was asked, and it now works for the example given in the question. I tried these paths successfully:
$.properties['footer.navigationLinks']
$.properties.[footer.navigationLinks]
$.properties.['footer.navigationLinks']
$['properties']['footer.navigationLinks']
$.['properties'].['footer.navigationLinks']
properties.['footer.navigationLinks']
etc.
This issue was reported in 2007 as issue #4 - Member names containing dot fail and fixed.
The fix is not present in this online jsonpath.com implementation, but it is fixed in this old archive and probably in most of the forks that have been created since (like here and here).
Details about the bug
A comparison between the buggy and 2007-corrected version of the code, reveals that the correction was made in the private normalize function.
In the 2007-corrected version it reads:
normalize: function(expr) {
var subx = [];
return expr.replace(/[\['](\??\(.*?\))[\]']|\['(.*?)'\]/g, function($0,$1,$2){
return "[#"+(subx.push($1||$2)-1)+"]";
}) /* http://code.google.com/p/jsonpath/issues/detail?id=4 */
.replace(/'?\.'?|\['?/g, ";")
.replace(/;;;|;;/g, ";..;")
.replace(/;$|'?\]|'$/g, "")
.replace(/#([0-9]+)/g, function($0,$1){
return subx[$1];
});
},
The first and last replace in that sequence make sure the second replace does not interpret a point in a property name as a property separator.
I had a look at the more up-to-date forks that have been made since then, and the code has evolved enormously since.
Conclusion:
jsonpath.com is based on an outdated version of JSONPath and is not reliable for previewing what current libraries would provide you with.
You can encapsulate the 'key with dots' with single quotes as below
response.jsonpath().get("properties.'footer.navigationLinks'")
Or even escape the single quotes as shown:
response.jsonpath().get("properties.\'footer.navigationLinks\'")
Both work fine

How do I do a partial match in Elasticsearch?

I have a link like http://drive.google.com and I want to match "google" out of the link.
I have:
query: {
bool : {
must: {
match: { text: 'google'}
}
}
}
But this only matches if the whole text is 'google' (case insensitive, so it also matches Google or GooGlE etc). How do I match for the 'google' inside of another string?
The point is that the ElasticSearch regex you are using requires a full string match:
Lucene’s patterns are always anchored. The pattern provided must match the entire string.
Thus, to match any character (but a newline), you can use .* pattern:
match: { text: '.*google.*'}
^^ ^^
In ES6+, use regexp insted of match:
"query": {
"regexp": { "text": ".*google.*"}
}
One more variation is for cases when your string can have newlines: match: { text: '(.|\n)*google(.|\n)*'}. This awful (.|\n)* is a must in ElasticSearch because this regex flavor does not allow any [\s\S] workarounds, nor any DOTALL/Singleline flags. "The Lucene regular expression engine is not Perl-compatible but supports a smaller range of operators."
However, if you do not plan to match any complicated patterns and need no word boundary checking, regex search for a mere substring is better performed with a mere wildcard search:
{
"query": {
"wildcard": {
"text": {
"value": "*google*",
"boost": 1.0,
"rewrite": "constant_score"
}
}
}
}
See Wildcard search for more details.
NOTE: The wildcard pattern also needs to match the whole input string, thus
google* finds all strings starting with google
*google* finds all strings containing google
*google finds all strings ending with google
Also, bear in mind the only pair of special characters in wildcard patterns:
?, which matches any single character
*, which can match zero or more characters, including an empty one
use wildcard query:
'{"query":{ "wildcard": { "text.keyword" : "*google*" }}}'
For both partial and full text matching ,the following worked
"query" : {
"query_string" : {
"query" : "*searchText*",
"fields" : [
"fieldName"
]
}
I can't find a breaking change disabling regular expressions in match, but match: { text: '.*google.*'} does not work on any of my Elasticsearch 6.2 clusters. Perhaps it is configurable?
Regexp works:
"query": {
"regexp": { "text": ".*google.*"}
}
For partial matching you can either use prefix or match_phrase_prefix.
For a more generic solution you can look into using a different analyzer or defining your own. I am assuming you are using the standard analyzer which would split http://drive.google.com into the tokens "http" and "drive.google.com". This is why the search for just google isn't working because it is trying to compare it to the full "drive.google.com".
If instead you indexed your documents using the simple analyzer it would split it up into "http", "drive", "google", and "com". This will allow you to match anyone of those terms on their own.
using node.js client
tag_name is the field name, value is the incoming search value.
const { body } = await elasticWrapper.client.search({
index: ElasticIndexs.Tags,
body: {
query: {
wildcard: {
tag_name: {
value: `*${value}*`,
boost: 1.0,
rewrite: 'constant_score',
},
},
},
},
});
You're looking for a wildcard search. According to the official documentation, it can be done as follows:
query_string: {
query: `*${keyword}*`,
fields: ["fieldOne", "fieldTwo"],
},
Wildcard searches can be run on individual terms, using ? to replace a single character, and * to replace zero or more characters: qu?ck bro*
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-wildcard
Be careful, though:
Be aware that wildcard queries can use an enormous amount of memory and perform very badly — just think how many terms need to be queried to match the query string "a* b* c*".
Allowing a wildcard at the beginning of a word (eg "*ing") is particularly heavy, because all terms in the index need to be examined, just in case they match. Leading wildcards can be disabled by setting allow_leading_wildcard to false.

JSON in MongoDb: {"1" : "some content"} VS { _1 : "some content"} Any difference?

I am using a dynamic key that is a series of numbers. Because you have to double quote numbers that are keys in JSON, is there any difference between using a dynamic key that is a number turned to a string versus an underscored number that's a string?
Basically, the difference of converting a number to a string versus concatenating it with an underscore (which turns it to a string).
I DOESN'T seem like there is a difference, however, I'd like to ask this question to everyone because sometimes an unexpected difference does turns up.
{ "1" : "some content",
"2" : "some more content"
}
versus
{ _1 : "some content",
_2 : "some more content"
}
Thanks.
Note that it is not JSON that you are sending to MongoDb but javascript objects. The difference here). And for this reason the following expressions are equivalent.
// unquoted key
db.col.insert({ 1: "key 1 unquoted"})
// quoted key
db.col.insert({ "1": "key 1 quoted"})
So back to your question, the only difference is that in one case you have 1 as key, and in the other _1.
But of course, it depends also what drivers you use to write this data to Mongo. They might be guilty if you see any difference between quoting or not-quoting the keys. Testing in the mongo shell, you get the same results.
The above is all true for top level keys. But if you have keys 1, 2, 3 on other levels, things can become tricky, and for this reason I recommend you not to use numbers as keys. The problems come from the Mongo query syntax when handling arrays.
Assume the following document into a collection:
{
"foo": {
"0": "abc"
},
"bar": [ "x", "y", "z"]
}
Both queries below are valid:
db.col.find({ "foo.0": "abc" })
db.col.find({ "bar.0": "x" })
Just the semantics is different:
in the former query you query for the documents containing a foo key which is an object having a key 0 with value abc
in the latter query you ask for the documents containing a bar key which is an array having x on the first position (0)
is interpreted