CatBoostClassifier - AUC metric - catboost

I got question about CatBoostClassifier.
params = {
'loss_function' : 'Logloss',
'eval_metric' : 'AUC',
'verbose' : 200,
'random_seed' : 42,
'custom_metric' : 'AUC:hints=skip_train~false'
}
cbc = CatBoostClassifier(**params)
cbc.fit(x_tr, y_tr,
eval_set = (x_te, y_te),
use_best_model = True,
plot = True
);
predictions = cbc.predict(x_te)
Model results:
bestTest = 0.6786987522
But when I try :
from sklearn import metrics
auc = metrics.roc_auc_score(y_te, predictions)
auc
I got 0.5631684491978609 result. Why this results differ ? What first and second result mean ? Which one is final metric of my cbc model ?

OK,
I found solution. I should use:
predictions = cbc.predict_proba(x_te)
rather than
predictions = cbc.predict(x_te)
Now I have the same results.

Related

Understanding the output of a pyomo model, number of solutions is zero?

I am using this code to create a solve a simple problem:
import pyomo.environ as pyo
from pyomo.core.expr.numeric_expr import LinearExpression
model = pyo.ConcreteModel()
model.nVars = pyo.Param(initialize=4)
model.N = pyo.RangeSet(model.nVars)
model.x = pyo.Var(model.N, within=pyo.Binary)
model.coefs = [1, 1, 3, 4]
model.linexp = LinearExpression(constant=0,
linear_coefs=model.coefs,
linear_vars=[model.x[i] for i in model.N])
def caprule(m):
return m.linexp <= 50
model.capme = pyo.Constraint(rule=caprule)
model.obj = pyo.Objective(expr = model.linexp, sense = maximize)
results = SolverFactory('glpk', executable='/usr/bin/glpsol').solve(model)
results.write()
And this is the output:
# ==========================================================
# = Solver Results =
# ==========================================================
# ----------------------------------------------------------
# Problem Information
# ----------------------------------------------------------
Problem:
- Name: unknown
Lower bound: 50.0
Upper bound: 50.0
Number of objectives: 1
Number of constraints: 2
Number of variables: 5
Number of nonzeros: 5
Sense: maximize
# ----------------------------------------------------------
# Solver Information
# ----------------------------------------------------------
Solver:
- Status: ok
Termination condition: optimal
Statistics:
Branch and bound:
Number of bounded subproblems: 0
Number of created subproblems: 0
Error rc: 0
Time: 0.09727835655212402
# ----------------------------------------------------------
# Solution Information
# ----------------------------------------------------------
Solution:
- number of solutions: 0
number of solutions displayed: 0
It says the number of solutions is 0, and yet it does solve the problem:
print(list(model.x[i]() for i in model.N))
Will output this:
[1.0, 1.0, 1.0, 1.0]
Which is a correct answer to the problem. what am I missing?
The interface between pyomo and glpk sometimes (always?) seems to return 0 for the number of solutions. I'm assuming there is some issue with the generalized interface between the pyomo core module and the various solvers that it interfaces with. When I use glpk and cbc solvers on this, it reports the number of solutions as zero. Perhaps those solvers don't fill that data element in the generalized interface. Somebody w/ more experience in the data glob returned from the solver may know precisely. That said, the main thing to look at is the termination condition, which I've found to be always accurate. It reports optimal.
I suspect that you have some mixed code from another model in your example. When I fix a typo or two (you missed the pyo prefix on a few things), it solves fine and gives the correct objective value as 9. I'm not sure where 50 came from in your output.
(slightly cleaned up) Code:
import pyomo.environ as pyo
from pyomo.core.expr.numeric_expr import LinearExpression
model = pyo.ConcreteModel()
model.nVars = pyo.Param(initialize=4)
model.N = pyo.RangeSet(model.nVars)
model.x = pyo.Var(model.N, within=pyo.Binary)
model.coefs = [1, 1, 3, 4]
model.linexp = LinearExpression(constant=0,
linear_coefs=model.coefs,
linear_vars=[model.x[i] for i in model.N])
def caprule(m):
return m.linexp <= 50
model.capme = pyo.Constraint(rule=caprule)
model.obj = pyo.Objective(expr = model.linexp, sense = pyo.maximize)
solver = pyo.SolverFactory('glpk') #, executable='/usr/bin/glpsol').solve(model)
results = solver.solve(model)
print(results)
model.obj.pprint()
model.obj.display()
Output:
Problem:
- Name: unknown
Lower bound: 9.0
Upper bound: 9.0
Number of objectives: 1
Number of constraints: 2
Number of variables: 5
Number of nonzeros: 5
Sense: maximize
Solver:
- Status: ok
Termination condition: optimal
Statistics:
Branch and bound:
Number of bounded subproblems: 0
Number of created subproblems: 0
Error rc: 0
Time: 0.00797891616821289
Solution:
- number of solutions: 0
number of solutions displayed: 0
obj : Size=1, Index=None, Active=True
Key : Active : Sense : Expression
None : True : maximize : x[1] + x[2] + 3*x[3] + 4*x[4]
obj : Size=1, Index=None, Active=True
Key : Active : Value
None : True : 9.0

How to use Karate continueOnStepFailure to assert each Json field values

The goal is to do soft assertion on each JSON field.
* def KAFKA_TOPIC = "topic1"
* def kafkaExpected = {field1:"value1",field2:"value2",field3:"value3"}
* def kafkaActual = {"topic1":[{field1:"value1",field2:"x",field3:"y"}]}
* configure continueOnStepFailure = { enabled: true, continueAfter: false, keywords: ['match'] }
* match (kafkaActual[KAFKA_TOPIC]) == ['#(kafkaExpected)'] <-- do we have one-liner like this to do soft assertions on all fields?
* configure continueOnStepFailure = false
Output:
$[0].field2 | not equal (STRING:STRING)
'x'
'value2'
Instead of doing it 1 by 1.
* match (kafkaActual[KAFKA_TOPIC])[0].field1 == kafkaExpected.field1
* match (kafkaActual[KAFKA_TOPIC])[0].field2 == kafkaExpected.field2
* match (kafkaActual[KAFKA_TOPIC])[0].field3 == kafkaExpected.field3
Output:
match failed: EQUALS
$ | not equal (STRING:STRING)
'x'
'value2'
match failed: EQUALS
$ | not equal (STRING:STRING)
'y'
'value3'
And whats weird is that on terminal logs it only printed one assertion, either on both approach.
$ | not equal (STRING:STRING)
'y'
'value3'
Trying to use Karate.forEach but seems like not the right path.
Found a solution from this link provided by Peter. I just need to transform the JSON to list of key, value format and use it as data source.
From:
{field1:"value1",field2:"value2",field3:"value3"}
Transformed To:
[{key:"field1",value:"value1"},{key:"field2",value:"value2"},{key:"field3",value:"value3"}]
Function used and usage:
* def input = INPUT
* def func =
"""
function(obj){
var output = [];
for (var i in obj) {
output.push({key: i, value: obj[i]});
}
return output
}
"""
* json kafkaAttributes = func(input)
* configure continueOnStepFailure = { enabled: true, continueAfter: false, keywords: ['match'] }
* karate.call('kafka.feature#validateFieldsAndValues',kafkaAttributes)
* configure continueOnStepFailure = false
#validateFieldsAndValues
Scenario:
* match (response[KAFKA_TOPIC][0][key]) contains value

How use wikidata api to access to the statements

I'm trying to get information from Wikidata. For example, to access to "cobalt-70" I use the API.
API_ENDPOINT = "https://www.wikidata.org/w/api.php"
query = "cobalt-70"
params = {
'action': 'wbsearchentities',
'format': 'json',
'language': 'en',
'search': query
}
r = requests.get(API_ENDPOINT, params = params)
print(r.json())
So there is a "claims" which gives access to the statements. Is there a best way to check if a value exists in the statement? For example, "cobalt-70" have the value 0.5 inside the property P2114. So how can I check if a value exists in the statement of the entity? As this example.
Is there an approach to access it. Thank you!
I'm not sure this is exactly what you are looking for, but if it's close enough, you can probably modify it as necessary:
import requests
import json
url = 'https://www.wikidata.org/wiki/Special:EntityData/Q18844865.json'
req = requests.get(url)
targets = j_dat['entities']['Q18844865']['claims']['P2114']
for target in targets:
values = target['mainsnak']['datavalue']['value'].items()
for value in values:
print(value[0],value[1])
Output:
amount +0.5
unit http://www.wikidata.org/entity/Q11574
upperBound +0.6799999999999999
lowerBound +0.32
amount +108.0
unit http://www.wikidata.org/entity/Q723733
upperBound +115.0
lowerBound +101.0
EDIT:
To find property id by value, try:
targets = j_dat['entities']['Q18844865']['claims'].items()
for target in targets:
line = target[1][0]['mainsnak']['datavalue']['value']
if isinstance(line,dict):
for v in line.values():
if v == "+0.5":
print('property: ',target[0])
Output:
property: P2114
I try a solution which consists to search inside the json object as the solution proposed here : https://stackoverflow.com/a/55549654/8374738. I hope it can help. Let's give you the idea.
import pprint
def search(d, search_pattern, prev_datapoint_path=''):
output = []
current_datapoint = d
current_datapoint_path = prev_datapoint_path
if type(current_datapoint) is dict:
for dkey in current_datapoint:
if search_pattern in str(dkey):
c = current_datapoint_path
c+="['"+dkey+"']"
output.append(c)
c = current_datapoint_path
c+="['"+dkey+"']"
for i in search(current_datapoint[dkey], search_pattern, c):
output.append(i)
elif type(current_datapoint) is list:
for i in range(0, len(current_datapoint)):
if search_pattern in str(i):
c = current_datapoint_path
c += "[" + str(i) + "]"
output.append(i)
c = current_datapoint_path
c+="["+ str(i) +"]"
for i in search(current_datapoint[i], search_pattern, c):
output.append(i)
elif search_pattern in str(current_datapoint):
c = current_datapoint_path
output.append(c)
output = filter(None, output)
return list(output)
And you just need to use:
pprint.pprint(search(res.json(),'0.5','res.json()'))
Output:
["res.json()['claims']['P2114'][0]['mainsnak']['datavalue']['value']['amount']"]

Pandas DataFrame KeyError: 1

I am a beginner so cant figure a reason for the error in the following code when train.jsonl uses format like that
{"claim": "But he said if people really want to know if they have CHIP they can get a blood test that costs a few MONEYc1", "evidence": "sentenceID100037", "label": "0"}
{"claim": "This is rather a courtly formulation and would doubtless trigger further eyerolling if uttered in", "evidence": "sentenceID100038", "label": "0"}
The top part executes without problem and displays the data.
import pandas as pd
prefix = '/content/'
train_df = pd.read_json(prefix + 'train.jsonl', orient='records', lines=True)
train_df.head()
[See my Colab Notebook][https://colab.research.google.com/gist/lenyabloko/0e17ebe0f3a0e808779bc1fa95e9b24d/semeval2020-delex.ipynb]
I even tried this additional trick which explained comments about 0 column
prefix = '/content/'
train_df = pd.read_json(prefix + 'train_delex.jsonl', orient='columns')
train_df.to_csv(prefix+'train.tsv', sep='\t', index=False, header=False)
train_df = pd.read_csv(prefix + 'train.tsv', header=None)
train_df.head()
Now I see column labeled '0' instead of the original three columns {"claim": "...", "evidence": " ...", "label": "..."} from the above JSONL file (why is that?)
But when I add DataFrame code it results in error
train_df = pd.DataFrame({
'id': train_df[1],
'text': train_df[0],
'labels':train_df[2]
})
In light of the column named "0" this wouldn't work. But where did that column come from??
KeyError Traceback (most recent call last)
2 frames
<ipython-input-16-0537eda6b397> in <module>()
6
7 train_df = pd.DataFrame({
----> 8 'id': train_df[1],
9 'text': train_df[0],
10 'labels':train_df[2]
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in __getitem__(self, key)
2993 if self.columns.nlevels > 1:
2994 return self._getitem_multilevel(key)
-> 2995 indexer = self.columns.get_loc(key)
2996 if is_integer(indexer):
2997 indexer = [indexer]
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2897 return self._engine.get_loc(key)
2898 except KeyError:
-> 2899 return self._engine.get_loc(self._maybe_cast_indexer(key))
2900 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2901 if indexer.ndim > 1 or indexer.size > 1:
Here is the solution that worked for me:
import pandas as pd
prefix = '/content/'
test_df = pd.read_json(prefix + 'test_delex.jsonl', orient='records', lines=True)
test_df.rename(columns={'claim': 'text', 'evidence': 'id', 'label':'labels'}, inplace=True)
cols = test_df.columns.tolist()
cols = cols[-1:] + cols[:-1]
cols = cols[-1:] + cols[:-1]
test_df = test_df[cols]
test_df.to_csv(prefix+'test.csv', sep=',', index=False, header=False)
test_df.head()
I updated my shared Colab Notebook linked in the question above

why the output is not logically right?

i'm trying to implement this code. here are the problems:
1. i want to use variable typ_transport in further coding not just inside if.(it does not recognize the variable.)
2. the logic seems to be right but when i change the values in jsonStr (e.g. "walk" : true to "walk" : fasle ) the output does not print the right output.
Could any one help me with this? thanks
import json
jsonStr = '{"walk" : true, "bike" : false, "tram" : false}'
inputvar = json.loads(jsonStr)
if inputvar['walk'] == 'True' and inputvar['bike'] == 'False' :
typ_transport='foot'
elif inputvar['walk'] == 'False' and inputvar['bike'] == 'True' :
typ_transport='bicycle'
class transport:
if typ_transport=='foot':
velocity=80
typ='foot'
elif typ_transport=='bicycle':
velocity=330
typ='bicycle'
def __init__(self,typ,velocity):
self.velocity = velocity
self.typ = typ
if inputvar['tram'] == 'False' :
radius= T*transport.velocity
print (radius)
else :
print (typ_transport, 333)
I see some problems in the code, so I'll try to point them out as I go through your questions
i want to use variable typ_transport in further coding not just inside if.(it does not recognize the variable.)
The reason why you can't access the typ_transport variable is that it's created inside the if statement. If you wish to access the variable later in the code, you would have to change the scope of the typ_transport into a global scope.
You can do this in two ways. First is creating a global variable before you start the if statement
typ_transport = ""
if inputvar['walk'] == True and inputvar['bike'] == False:
typ_transport = 'foot'
The second way would be to create a global variable inside the if statement using global keyword. This way is highly discouraged since it is easy to lose the track of variables and their scopes.
the logic seems to be right but when i change the values in jsonStr (e.g. "walk" : true to "walk" : fasle ) the output does not print the right output
Aside from the spelling errors you got there, Python booleans are kept in True, and False (no single quote, first letter capitalized). When you use JSON module, it should parse correctly, but it's always good idea to double check.
Lastly, You are using class but it's not organized. Let's try to make it look little more tidier.
class transport:
def __init__(self,typ_transport): #utilizing global variable we created
if typ_transport == 'foot':
velocity = 80
self.typ_transport = self.typ_transport
self.velocity = velocity
elif typ_transport == 'bicycle':
......
Now to get the velocity when typ_transport = 'foot'
passenger = transport(typ_transport) #creating passenger object
velocity = passenger.velocity
import json
jsonStr = ('[{"walk" : true, "bike" : false, "tram" : false}, '
'{"walk" : false, "bike" : true, "tram" : false}, '
'{"walk" : false, "bike" : false, "tram" : true},'
'{"walk" : false, "bike" : false, "tram" : false}]'
)
inputvar = json.loads(jsonStr)
class Transport:
def __init__(self,typ):
self.typ = typ
self.velocity = None
if typ == 'foot':
self.velocity = 80
elif typ == 'bicycle':
self.velocity = 330
elif typ == 'tram':
self.velocity = 333
for var in inputvar:
typ_transport = None
if var['walk'] is True and var['bike'] is False:
typ_transport = 'foot'
elif var['walk'] is False and var['bike'] is True:
typ_transport = 'bicycle'
elif var['tram'] is True:
typ_transport = 'tram'
transport = Transport(typ_transport)
print(transport.typ, transport.velocity)
This make more sense to me.
Feel free to change it, if I misunderstood your logic.