prolog function to infer new facts from data - function

I have a dataset containing "facts" recognizable to prolog, i.e.:
'be'('mr jiang', 'representative of china').
'support'('the establishment of the sar', 'mr jiang').
'be more than'('# distinguished guests', 'the principal representatives').
'end with'('the playing of the british national anthem', 'hong kong').
'follow at'('the stroke of midnight', 'this').
'take part in'('the ceremony', 'both countries').
'start at about'('# pm', 'the ceremony').
'end about'('# am', 'the ceremony').
I want the system to recognize that 'mr jiang' is referenced in both of the following "facts":
'be'('mr jiang', 'representative of china').
'support'('the establishment of the sar', 'mr jiang').
Subsequently the system should then infer:
'support'('the establishment of the sar', 'representative of china').
I've spent some time looking at the FOIL algorithm, do you think that would do the trick? But I guess that's overkill.
Would something like this work:
'X'('Y', 'Z') :-
'A'('Y', 'B') ∧ '0'('B', 'Z');
Is it possible to make such a general "rule" like that? Or does it have to be more specific?

Related

How to retrieve data from a json

I retrieved a dataset from a news API in JSON format. I want to extract the news description from the JSON data.
This is my code:-
import requests
import json
url = ('http://newsapi.org/v2/top-headlines?'
'country=us&'
'apiKey=608bf565c67f4d99994c08d74db82f54')
response = requests.get(url)
di=response.json()
di = json.dumps(di)
for di['articles'] in di:
print(article['title'])
The dataset looks like this:-
{'status': 'ok',
'totalResults': 38,
'articles': [
{'source':
{'id': 'the-washington-post',
'name': 'The Washington Post'},
'author': 'Derek Hawkins, Marisa Iati',
'title': 'Coronavirus updates: Texas, Florida and Arizona officials say early reopenings fueled an explosion of cases - The Washington Post',
'description': 'Local officials in states with surging coronavirus cases issued dire warnings Sunday about the spread of infections, saying the virus was rapidly outpacing containment efforts.',
'url': 'https://www.washingtonpost.com/nation/2020/07/05/coronavirus-update-us/',
'urlToImage': 'https://www.washingtonpost.com/wp-apps/imrs.php?src=https://arc-anglerfish-washpost-prod-washpost.s3.amazonaws.com/public/K3UMAKF6OMI6VF6BNTYRN77CNQ.jpg&w=1440',
'publishedAt': '2020-07-05T18:32:44Z',
'content': 'Here are some significant developments:\r\n<ul><li>The rolling seven-day average for daily new cases in the United States reached a record high for the 27th day in a row, climbing to 48,606 on Sunday, … [+5333 chars]'}])
Please guide me with this!
There are few corrections needed in your code.. below code should work and i have removed API KEY in answer make sure that you add one before testing
import requests
import json
url = ('http://newsapi.org/v2/top-headlines?'
'country=us&'
'apiKey=<API KEY>')
di=response.json()
#You don't need to dump json that is already in json format
#di = json.dumps(di)
#your loop is not correctly defined, below is correct way to do it
for article in di['articles']:
print(article['title'])
response.json
{'status': 'ok',
'totalResults': 38,
'articles': [
{'source':
{'id': 'the-washington-post',
'name': 'The Washington Post'},
'author': 'Derek Hawkins, Marisa Iati',
'title': 'Coronavirus updates: Texas, Florida and Arizona officials say early reopenings fueled an explosion of cases - The Washington Post',
'description': 'Local officials in states with surging coronavirus cases issued dire warnings Sunday about the spread of infections, saying the virus was rapidly outpacing containment efforts.',
'url': 'https://www.washingtonpost.com/nation/2020/07/05/coronavirus-update-us/',
'urlToImage': 'https://www.washingtonpost.com/wp-apps/imrs.php?src=https://arc-anglerfish-washpost-prod-washpost.s3.amazonaws.com/public/K3UMAKF6OMI6VF6BNTYRN77CNQ.jpg&w=1440',
'publishedAt': '2020-07-05T18:32:44Z',
'content': 'Here are some significant developments:\r\n<ul><li>The rolling seven-day average for daily new cases in the United States reached a record high for the 27th day in a row, climbing to 48,606 on Sunday, … [+5333 chars]'}]}
Code:
di = response.json() # Understand that 'di' is of type 'dictionary', key-value pair
for i in di["articles"]:
print(i["description"])
"articles" is one of the keys of dictionary di, It's corresponding value is of type list. "description" , which you are looking is part of this list (value of "articles"). Further list contains the dictionary (key-value pair).You can access from key - description

Data preprocessing for Named Entity Recognition?

I'm working on a Named Entity Recognition on resume dataset and we have entities like dates, phone, email etc.,,
And I'm working how to preprocess those entities. I'm currently adding a space after each puncuation like this,
DAVID B-Name
John I-Name
, O
IT O
Washington B-Address
, I-Address
DC I-Address
( B-Phone
107 I-Phone
) I-Phone
155
- I-Phone
4838 I-Phone
david B-Email
. I-Email
John I-Email
# I-Email
gmail I-Email
. I-Email
com I-Email
But I'm starting to question the process on how to handle such text during inference. I'm assuming even at inference we have to preprocess text using same process that is adding a space after each puncuation isn't it?
But it won't be so readable right?
For example at inference I have to provide input text like test # example . com? which is not readable isn't it? It only be able to predict entities in such format.
The problem you're trying to deal with is called tokenization. To deal with the formatting issue that you raise, often frameworks will extract the tokens from the underlying text in a way preserves the original text, such as keeping track of the character starts and ends for each token.
For instance, SpaCy in Python returns an object that stores all of this information:
import spacy
from pprint import pprint
nlp = spacy.load("en_core_web_sm")
doc = nlp("DAVID John, IT\nWashington, DC (107) 155-4838 david.John#gmail.com")
pprint([(token.text, token.idx, token.idx + len(token.text)) for token in doc])
output:
[('DAVID', 0, 5),
('John', 6, 10),
(',', 10, 11),
('IT', 12, 14),
('\n', 14, 15),
('Washington', 15, 25),
(',', 25, 26),
('DC', 27, 29),
('(', 30, 31),
('107', 31, 34),
(')', 34, 35),
('155', 36, 39),
('-', 39, 40),
('4838', 40, 44),
('david.John#gmail.com', 45, 65)]
You could either do the same sort of thing for yourself (e.g. keep a counter as you add spaces) or use an existing tokenizer (such as SpaCy, CoreNLP, tensorflow, etc.)

Business objects error while trying to use "OR" function

The formula is found below:
=NoFilter(Count([Same Day];All) Where ([Person Location- Facility (Curr)]="FH ORL") Where ([Order Catalog Short Description]="Physical Therapy For Whirlpool Wound Care Evaluation And Treatment") Or ([Person Location- Nurse Unit (Curr)] InList ("7TWR";"RIO1";"GT12";"GT14";"9TWR";"XTWR";"RIO")))
Error Message: The expression or sub-expression at position 10 in the 'Or' function uses an invalid data type
The structure of your formula is:
Nofilter (xxx) Where (yyy) Or (zzz InList(aaa))
It's complaining because it sees yyy as the only parameter to Where(). The structure should look like:
Nofilter (xxx) Where (yyy Or zzz InList(aaa))
So try:
=NoFilter(Count([Same Day];All) Where ([Person Location- Facility (Curr)]="FH ORL") Where ([Order Catalog Short Description]="Physical Therapy For Whirlpool Wound Care Evaluation And Treatment" Or [Person Location- Nurse Unit (Curr)] InList ("7TWR";"RIO1";"GT12";"GT14";"9TWR";"XTWR";"RIO")))

What's the tags meaning of Stanford dependency parser(3.9.1)?

I used the Stanford dependency parser(3.9.1) to parse a sentence, and I got the result as the following.
[[(('investigating', 'VBG'), 'nmod', ('years', 'NNS')),
(('years', 'NNS'), 'case', ('In', 'IN')),
(('years', 'NNS'), 'det', ('the', 'DT')),
(('years', 'NNS'), 'amod', ('last', 'JJ')),
(('years', 'NNS'), 'nmod', ('century', 'NN')),
(('century', 'NN'), 'case', ('of', 'IN')),
(('century', 'NN'), 'det', ('the', 'DT')),
(('century', 'NN'), 'amod', ('nineteenth', 'JJ')),
(('investigating', 'VBG'), 'nsubj', ('Planck', 'NNP')),
(('investigating', 'VBG'), 'aux', ('was', 'VBD')),
(('investigating', 'VBG'), 'dobj', ('problem', 'NN')),
(('problem', 'NN'), 'det', ('the', 'DT')),
(('problem', 'NN'), 'nmod', ('radiation', 'NN')),
(('radiation', 'NN'), 'case', ('of', 'IN')),
(('radiation', 'NN'), 'amod', ('black-body', 'JJ')),
(('radiation', 'NN'), 'acl', ('posed', 'VBN')),
(('posed', 'VBN'), 'advmod', ('first', 'RB')),
(('posed', 'VBN'), 'nmod', ('Kirchhoff', 'NNP')),
(('Kirchhoff', 'NNP'), 'case', ('by', 'IN')),
(('Kirchhoff', 'NNP'), 'advmod', ('earlier', 'RBR')),
(('earlier', 'RBR'), 'nmod:npmod', ('years', 'NNS')),
(('years', 'NNS'), 'det', ('some', 'DT')),
(('years', 'NNS'), 'amod', ('forty', 'JJ'))]]
Some of the tags meaning such as 'nmod' and 'acl' are missing in the StanfordDependencyManual.The newest manual version I can find is 3.7.0. I also find some explanation at a Standard_list_of_dependency_relations
But it still missed some tags.
Hence, my question is where can I find the newest version of the explanation of these tags? Thanks!
For the last few versions, the Stanford parser has been generating Universal Dependencies rather than Stanford Dependencies. The new relation set can be found here, and are listed below (for version 1 -- version 2 seems to be a work-in-progress still?):
acl: clausal modifier of noun
acl:relcl: relative clause modifier
advcl: adverbial clause modifier
advmod: adverbial modifier
amod: adjectival modifier
appos: appositional modifier
aux: auxiliary
auxpass: passive auxiliary
case: case marking
cc: coordination
cc:preconj: preconjunct
ccomp: clausal complement
compound: compound
compound:prt: phrasal verb particle
conj: conjunct
cop: copula
csubj: clausal subject
csubjpass: clausal passive subject
dep: dependent
det: determiner
det:predet: predeterminer
discourse: discourse element
dislocated: dislocated elements
dobj: direct object
expl: expletive
foreign: foreign words
goeswith: goes with
iobj: indirect object
list: list
mark: marker
mwe: multi-word expression
name: name
neg: negation modifier
nmod: nominal modifier
nmod:npmod: noun phrase as adverbial modifier
nmod:poss: possessive nominal modifier
nmod:tmod: temporal modifier
nsubj: nominal subject
nsubjpass: passive nominal subject
nummod: numeric modifier
parataxis: parataxis
punct: punctuation
remnant: remnant in ellipsis
reparandum: overridden disfluency
root: root
vocative: vocative
xcomp: open clausal complement
Although no longer maintained, you can get the old dependency format by setting the property depparse.language to English (see, e.g., here):
properties.setProperty("depparse.language", "English")

How to define special "untokenizable" words for nltk.word_tokenize

I'm using nltk.word_tokenize for tokenizing some sentences which contain programming languages, frameworks, etc., which get incorrectly tokenized.
For example:
>>> tokenize.word_tokenize("I work with C#.")
['I', 'work', 'with', 'C', '#', '.']
Is there a way to enter a list of "exceptions" like this to the tokenizer? I already have compiled a list of all the things (languages, etc.) that I don't want to split.
The Multi Word Expression Tokenizer should be what you need.
You add the list of exceptions as tuples and pass it the already tokenized sentences:
tokenizer = nltk.tokenize.MWETokenizer()
tokenizer.add_mwe(('C', '#'))
tokenizer.add_mwe(('F', '#'))
tokenizer.tokenize(['I', 'work', 'with', 'C', '#', '.'])
['I', 'work', 'with', 'C_#', '.']
tokenizer.tokenize(['I', 'work', 'with', 'F', '#', '.'])
['I', 'work', 'with', 'F_#', '.']