Related
My question is somehow related to https://learn.microsoft.com/en-us/answers/questions/217305/data-input-format-call-the-service-for-azure-ml-ti.html - however, the provided solution does not seem to work.
I am constructing a simple model with heart-disease dataset but I wrap it into Pipeline as I use some featurization steps (scaling, encoding etc.) The full script below:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import pickle
# data input
df = pd.read_csv('heart.csv')
# numerical variables
num_cols = ['age',
'trestbps',
'chol',
'thalach',
'oldpeak'
]
# categorical variables
cat_cols = ['sex',
'cp',
'fbs',
'restecg',
'exang',
'slope',
'ca',
'thal']
# changing format of the categorical variables
df[cat_cols] = df[cat_cols].apply(lambda x: x.astype('object'))
# target variable
y = df['target']
# features
X = df.drop(['target'], axis=1)
# data split:
# random seed
np.random.seed(42)
# splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
stratify=y)
# double check
X_train.shape, X_test.shape, y_train.shape, y_test.shape
# pipeline for numerical data
num_preprocessing = Pipeline([('num_imputer', SimpleImputer(strategy='mean')), # imputing with mean
('minmaxscaler', MinMaxScaler())]) # scaling
# pipeline for categorical data
cat_preprocessing = Pipeline([('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')), # filling missing values
('onehot', OneHotEncoder(drop='first', handle_unknown='error'))]) # One Hot Encoding
# preprocessor - combining pipelines
preprocessor = ColumnTransformer([
('categorical', cat_preprocessing, cat_cols),
('numerical', num_preprocessing, num_cols)
])
# initial model parameters
log_ini_params = {'penalty': 'l2',
'tol': 0.0073559740277086005,
'C': 1.1592424247511928,
'fit_intercept': True,
'solver': 'liblinear'}
# model - Pipeline
log_clf = Pipeline([('preprocessor', preprocessor),
('clf', LogisticRegression(**log_ini_params))])
log_clf.fit(X_train, y_train)
# dumping the model
f = 'model/log.pkl'
with open(f, 'wb') as file:
pickle.dump(log_clf, file)
# loading it
loaded_model = joblib.load(f)
# double check on a single datapoint
new_data = pd.DataFrame({'age': 71,
'sex': 0,
'cp': 0,
'trestbps': 112,
'chol': 203,
'fbs': 0,
'restecg': 1,
'thalach': 185,
'exang': 0,
'oldpeak': 0.1,
'slope': 2,
'ca': 0,
'thal': 2}, index=[0])
loaded_model.predict(new_data)
...and it works just fine. Then I deploy the model to the Azure Web Service using these steps:
I create the score.py file
import joblib
from azureml.core.model import Model
import json
def init():
global model
model_path = Model.get_model_path('log') # logistic
print('Model Path is ', model_path)
model = joblib.load(model_path)
def run(data):
try:
data = json.loads(data)
result = model.predict(data['data'])
# any data type, as long as it is JSON serializable.
return {'data' : result.tolist() , 'message' : 'Successfully classified heart diseases'}
except Exception as e:
error = str(e)
return {'data' : error , 'message' : 'Failed to classify heart diseases'}
I deploy the model:
from azureml.core import Workspace
from azureml.core.webservice import AciWebservice
from azureml.core.webservice import Webservice
from azureml.core.model import InferenceConfig
from azureml.core.environment import Environment
from azureml.core import Workspace
from azureml.core.model import Model
from azureml.core.conda_dependencies import CondaDependencies
ws = Workspace.from_config()
model = Model.register(workspace = ws,
model_path ='model/log.pkl',
model_name = 'log',
tags = {'version': '1'},
description = 'Heart disease classification',
)
# to install required packages
env = Environment('env')
cd = CondaDependencies.create(pip_packages=['pandas==1.1.5', 'azureml-defaults','joblib==0.17.0'], conda_packages = ['scikit-learn==0.23.2'])
env.python.conda_dependencies = cd
# Register environment to re-use later
env.register(workspace = ws)
print('Registered Environment')
myenv = Environment.get(workspace=ws, name='env')
myenv.save_to_directory('./environ', overwrite=True)
aciconfig = AciWebservice.deploy_configuration(
cpu_cores=1,
memory_gb=1,
tags={'data':'heart disease classifier'},
description='Classification of heart diseases',
)
inference_config = InferenceConfig(entry_script='score.py', environment=myenv)
service = Model.deploy(workspace=ws,
name='hd-model-log',
models=[model],
inference_config=inference_config,
deployment_config=aciconfig,
overwrite = True)
service.wait_for_deployment(show_output=True)
url = service.scoring_uri
print(url)
The deployment is fine:
Succeeded
ACI service creation operation finished, operation "Succeeded"
But I can not make any predictions with the new data. I try to use:
import pandas as pd
new_data = pd.DataFrame([[71, 0, 0, 112, 203, 0, 1, 185, 0, 0.1, 2, 0, 2],
[80, 0, 0, 115, 203, 0, 1, 185, 0, 0.1, 2, 0, 0]],
columns=['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal'])
Following the answer from this topic (https://learn.microsoft.com/en-us/answers/questions/217305/data-input-format-call-the-service-for-azure-ml-ti.html) I transform the data:
test_sample = json.dumps({'data': new_data.to_dict(orient='records')})
And try to make some predictions:
import json
import requests
data = test_sample
headers = {'Content-Type':'application/json'}
r = requests.post(url, data=data, headers = headers)
print(r.status_code)
print(r.json())
However, I encounter an error:
200
{'data': "Expected 2D array, got 1D array instead:\narray=[{'age': 71, 'sex': 0, 'cp': 0, 'trestbps': 112, 'chol': 203, 'fbs': 0, 'restecg': 1, 'thalach': 185, 'exang': 0, 'oldpeak': 0.1, 'slope': 2, 'ca': 0, 'thal': > 2}\n {'age': 80, 'sex': 0, 'cp': 0, 'trestbps': 115, 'chol': 203, 'fbs': 0, 'restecg': 1, 'thalach': 185, 'exang': 0, 'oldpeak': 0.1, 'slope': 2, 'ca': 0, 'thal': 0}].\nReshape your data either using array.reshape(-1, > 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.", 'message': 'Failed to classify heart diseases'}
How is it possible to adjust the input data to this form of predictions and add other output like predict_proba so I could store them in a separate output dataset?
I know this error is somehow related either with the "run" part of the score.py file or the last code cell that calls the webservice, but I'm unable to find it.
Would really appreciate some help.
I believe I managed to solve the problem - even though I encountered some serious issues. :)
As described here here - I edited the score.py script:
import joblib
from azureml.core.model import Model
import numpy as np
import json
import pandas as pd
import numpy as np
from inference_schema.schema_decorators import input_schema, output_schema
from inference_schema.parameter_types.numpy_parameter_type import NumpyParameterType
from inference_schema.parameter_types.pandas_parameter_type import PandasParameterType
from inference_schema.parameter_types.standard_py_parameter_type import StandardPythonParameterType
data_sample = PandasParameterType(pd.DataFrame({'age': pd.Series([0], dtype='int64'),
'sex': pd.Series(['example_value'], dtype='object'),
'cp': pd.Series(['example_value'], dtype='object'),
'trestbps': pd.Series([0], dtype='int64'),
'chol': pd.Series([0], dtype='int64'),
'fbs': pd.Series(['example_value'], dtype='object'),
'restecg': pd.Series(['example_value'], dtype='object'),
'thalach': pd.Series([0], dtype='int64'),
'exang': pd.Series(['example_value'], dtype='object'),
'oldpeak': pd.Series([0.0], dtype='float64'),
'slope': pd.Series(['example_value'], dtype='object'),
'ca': pd.Series(['example_value'], dtype='object'),
'thal': pd.Series(['example_value'], dtype='object')}))
input_sample = StandardPythonParameterType({'data': data_sample})
result_sample = NumpyParameterType(np.array([0]))
output_sample = StandardPythonParameterType({'Results':result_sample})
def init():
global model
# Example when the model is a file
model_path = Model.get_model_path('log') # logistic
print('Model Path is ', model_path)
model = joblib.load(model_path)
#input_schema('Inputs', input_sample)
#output_schema(output_sample)
def run(Inputs):
try:
data = Inputs['data']
result = model.predict_proba(data)
return result.tolist()
except Exception as e:
error = str(e)
return error
In the deployment step I adjusted the CondaDependencies:
# to install required packages
env = Environment('env')
cd = CondaDependencies.create(pip_packages=['pandas==1.1.5', 'azureml-defaults','joblib==0.17.0', 'inference-schema==1.3.0'], conda_packages = ['scikit-learn==0.22.2.post1'])
env.python.conda_dependencies = cd
# Register environment to re-use later
env.register(workspace = ws)
print('Registered Environment')
as
a) It is necessary to include inference-schema in the Dependencies file
b) I downgraded scikit-learn to scikit-learn==0.22.2.post1 version because of this issue
Now, when I feed the model with new data:
new_data = {
"Inputs": {
"data": [
{
"age": 71,
"sex": "0",
"cp": "0",
"trestbps": 112,
"chol": 203,
"fbs": "0",
"restecg": "1",
"thalach": 185,
"exang": "0",
"oldpeak": 0.1,
"slope": "2",
"ca": "0",
"thal": "2"
}
]
}
}
And use it for prediction:
import json
import requests
data = new_data
headers = {'Content-Type':'application/json'}
r = requests.post(url, str.encode(json.dumps(data)), headers = headers)
print(r.status_code)
print(r.json())
I get:
200 [[0.02325369841858338, 0.9767463015814166]]
Uff! Maybe someone will benefit from my painful learning path! :)
The main issue is with the conversion of categorical variables. The traditional method of handling categorical variable is using OneHotEncoder
# changing format of the categorical variables
df[cat_cols] = df[cat_cols].apply(lambda x: x.astype('object'))
The transforming data need to apply like mentioned below:
from sklearn.preprocessing import MinMaxScaler
cat_col =['sex',
'cp',
'fbs',
'restecg',
'exang',
'slope',
'ca',
'thal']
df_2 = pd.get_dummies(data[cat_col], drop_first = True)
[0,1]'s will be formed after applying dummies, then
new_data = pd.DataFrame([[71, 0, 0, 112, 203, 0, 1, 185, 0, 0.1, 2, 0, 2],
[80, 0, 0, 115, 203, 0, 1, 185, 0, 0.1, 2, 0, 0]],
columns=['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal'])
This can be applied with fewer changes in the syntax.
Edit:
new_data = {
"Inputs": {
"data": [
{
"age": 71,
"sex": "0",
"cp": "0",
"trestbps": 112,
"chol": 203,
"fbs": "0",
"restecg": "1",
"thalach": 185,
"exang": "0",
"oldpeak": 0.1,
"slope": "2",
"ca": "0",
"thal": "2"
}
]
}
}
I am using Z3 named SMT solver to generate a new set of random numbers from a given vector under some constraints. I am doing this in order to hide my input stream. The corresponding code can be found below:
from z3 import *
import sys
import io
import math
X0 = Real('X0')
X1 = Real('X1')
X2 = Real('X2')
X3 = Real('X3')
X4 = Real('X4')
X5 = Real('X5')
X6 = Real('X6')
X7 = Real('X7')
X8 = Real('X8')
X9 = Real('X9')
X10 = Real('X10')
X11 = Real('X11')
X12 = Real('X12')
X13 = Real('X13')
X14 = Real('X14')
DistinctParameter = [Distinct(X0 , X1 , X2 , X3 , X4 , X5 , X6 , X7 , X8 , X9 , X10 , X11 , X12 , X13 , X14 )]
maxPossibleValue = max(InputStream)
AggregateValue = 0
for x in InputStream:
AggregateValue = AggregateValue + float(x)
S_Con_Comparison1 = [(X0 < maxPossibleValue)]
S_Con_Comparison2 = [(X1 < maxPossibleValue)]
S_Con_Comparison3 = [(X2 < maxPossibleValue)]
S_Con_Comparison4 = [(X3 < maxPossibleValue)]
S_Con_Comparison5 = [(X4 < maxPossibleValue)]
S_Con_Comparison6 = [(X5 < maxPossibleValue)]
S_Con_Comparison7 = [(X6 < maxPossibleValue)]
S_Con_Comparison8 = [(X7 < maxPossibleValue)]
S_Con_Comparison9 = [(X8 < maxPossibleValue)]
S_Con_Comparison10 = [(X9 < maxPossibleValue)]
S_Con_Comparison11 = [(X10 < maxPossibleValue)]
S_Con_Comparison12 = [(X11 < maxPossibleValue)]
S_Con_Comparison13 = [(X12 < maxPossibleValue)]
S_Con_Comparison14 = [(X13 < maxPossibleValue)]
S_Con_Comparison15 = [(X14 < maxPossibleValue)]
S_Con_Comparison = S_Con_Comparison1 + S_Con_Comparison2 + S_Con_Comparison3 + S_Con_Comparison4 + S_Con_Comparison5 + S_Con_Comparison6 + S_Con_Comparison7 + S_Con_Comparison8 + S_Con_Comparison9 + S_Con_Comparison10 + S_Con_Comparison11 + S_Con_Comparison12 + S_Con_Comparison13 + S_Con_Comparison14 + S_Con_Comparison15
S_Con = [( X0 + X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 + X11 + X12 + X13 + X14 == AggregateValue)]
Solve = S_Con + S_Con_Comparison + DistinctParameter
s = Solver()
s.add(Solve)
x = Reals('x')
i = 0
output =[0] * len(InputStream)
if s.check() == sat:
m = s.model()
for d in m.decls():
location = int((repr(d).replace("X","")))
x=round(float(m[d].numerator_as_long())/float(m[d].denominator_as_long()),5)
output[location]= x
print(output)
Each of the values of the input stream can be taken from a possible set of size 2^25. As per my understanding, the only way to find the input stream is to do a brute force on the resulted stream. Given this circumstances, I want to know if it is possible to reverse engineer the input stream from the corresponding output stream.
As mentioned in the comments, SMT solvers should not be entrusted with the task of generating truly random models. Having said this, it doesn't look like you need such property to be guaranteed for your application.
I fixed your model so as to impose X_i >= 0, since this is a requirement in the comments.
from z3 import *
import sys
import io
import math
def obfuscate(input_stream):
X_list = [Real('X_{0}'.format(idx)) for idx in range(0, len(input_stream))]
assert len(X_list) == len(input_stream)
max_input_value = max(input_stream)
aggregate_value = sum(input_stream)
distinct_cs = Distinct(X_list)
lower_cs = [(0 <= Xi) for Xi in X_list]
upper_cs = [(Xi < max_input_value) for Xi in X_list]
same_sum_cs = (Sum(X_list) == aggregate_value)
s = Solver()
s.add(distinct_cs)
s.add(lower_cs)
s.add(upper_cs)
s.add(same_sum_cs)
status = s.check()
if status == sat:
r_ret = []
fp_ret = []
m = s.model()
for Xi in X_list:
r_value = m.eval(Xi)
r_ret.append(r_value)
num = r_value.numerator_as_long()
den = r_value.denominator_as_long()
fp_value = round(float(num) / float(den), 5)
fp_ret.append(fp_value)
return input_stream, aggregate_value, "sat", r_ret, fp_ret, sum(fp_ret)
else:
return input_stream, aggregate_value, "unsat", None, None, None
if __name__ == '__main__':
print("Same-value inputs are all unsat")
print(obfuscate([0.0, 0.0, 0.0]))
print(obfuscate([1.0, 1.0, 1.0]))
print(obfuscate([2.0, 2.0, 2.0]))
print("\nRe-ordering input does not change output")
print(obfuscate([1.0, 2.0, 3.0]))
print(obfuscate([1.0, 3.0, 2.0]))
print(obfuscate([3.0, 2.0, 1.0]))
print("")
print(obfuscate([0.1, 0.0, 0.0]))
print(obfuscate([0.0, 0.1, 0.0]))
print(obfuscate([0.0, 0.0, 0.1]))
print("\nSame-sum input do not necessarily map to the same outputs")
print(obfuscate([0.1, 0.9, 2.0]))
print(obfuscate([1.1, 0.1, 1.8]))
print("\nSame outputs may result from different inputs")
print(obfuscate([0.6, 1.3, 1.1]))
print(obfuscate([1.3, 0.7, 1.0]))
The output is:
Same-value inputs are all unsat
([0.0, 0.0, 0.0], 0.0, 'unsat', None, None, None)
([1.0, 1.0, 1.0], 3.0, 'unsat', None, None, None)
([2.0, 2.0, 2.0], 6.0, 'unsat', None, None, None)
Re-ordering input does not change output
([1.0, 2.0, 3.0], 6.0, 'sat', [5/2, 11/4, 3/4], [2.5, 2.75, 0.75], 6.0)
([1.0, 3.0, 2.0], 6.0, 'sat', [5/2, 11/4, 3/4], [2.5, 2.75, 0.75], 6.0)
([3.0, 2.0, 1.0], 6.0, 'sat', [5/2, 11/4, 3/4], [2.5, 2.75, 0.75], 6.0)
([0.1, 0.0, 0.0], 0.1, 'sat', [1/30, 1/15, 0], [0.03333, 0.06667, 0.0], 0.09999999999999999)
([0.0, 0.1, 0.0], 0.1, 'sat', [1/30, 1/15, 0], [0.03333, 0.06667, 0.0], 0.09999999999999999)
([0.0, 0.0, 0.1], 0.1, 'sat', [1/30, 1/15, 0], [0.03333, 0.06667, 0.0], 0.09999999999999999)
Same-sum input do not necessarily map to the same outputs
([0.1, 0.9, 2.0], 3.0, 'sat', [4/3, 5/3, 0], [1.33333, 1.66667, 0.0], 3.0)
([1.1, 0.1, 1.8], 3.0, 'sat', [7/5, 8/5, 0], [1.4, 1.6, 0.0], 3.0)
Same outputs may result from different inputs
([0.6, 1.3, 1.1], 3.0, 'sat', [23/20, 49/40, 5/8], [1.15, 1.225, 0.625], 3.0)
([1.3, 0.7, 1.0], 3.0, 'sat', [23/20, 49/40, 5/8], [1.15, 1.225, 0.625], 3.0)
This simple example allows us to make the following observations:
the output is determined by the values in input, but it is not affected by their order
the obfuscation procedure can be sensitive to variations in the input stream
Therefore, even if an attacker attempts to use rainbow tables to find the potential input multiset that generated an output sequence, they still cannot find the exact order of the values in the input stream.
Let's disregard the fact that building such rainbow tables is impractical due to the large number of input sequences of size 15 that can be generated with
a pool of 2^25 candidate values (a loose upper-bound would be 2^375), and assume that we have a way to access it efficiently.
Given an output sequence O, generated with obfuscate(), we can look for a match M inside our rainbow table, where M is a list of multisets that, when used as input, would result in the same output O. Let M[i] be the i-th input set in M containing n elements, each with multiplicity m_i. Then the number of possible permutations of M[i] is (source: Wikipedia):
In the simplest scenario in which every value in the input stream is different from the others, there are up to 15! = 1.307.674.368.000 permutations for each candidate solution M[i] in the match M. In your application, would the attacker have the time to try all of them?
I'm attempting to convert old code to PyTorch code as an experiment. Ultimately, I will be doing regression on a 10,000+ x 100 Matrix, updating weights and whatnot appropriately.
Trying to learn, I'm slowly scaling up on toy examples. I'm hitting a wall with the following sample code.
import torch
import torch.nn as nn
import torch.nn.functional as funct
from torch.autograd import Variable
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
x_data = Variable( torch.Tensor( [ [1.0, 2.0], [2.0, 3.0], [3.0, 4.0] ] ),
requires_grad=True )
y_data = Variable( torch.Tensor( [ [2.0], [4.0], [6.0] ] ) )
w = Variable( torch.randn( 2, 1, requires_grad=True ) )
b = Variable( torch.randn( 1, 1, requires_grad=True ) )
class Model(torch.nn.Module) :
def __init__(self) :
super( Model, self).__init__()
self.linear = torch.nn.Linear(2,1) ## 2 features per entry. 1 output
def forward(self, x2, w2, b2) :
y_pred = x2 # w2 + b2
return y_pred
model = Model()
criterion = torch.nn.MSELoss( size_average=False )
optimizer = torch.optim.SGD( model.parameters(), lr=0.01 )
for epoch in range(10) :
y_pred = model( x_data,w,b ) # Get prediction
loss = criterion( y_pred, y_data ) # Calc loss
print( epoch, loss.data.item() ) # Print loss
optimizer.zero_grad() # Zero gradient
loss.backward() # Calculate gradients
optimizer.step() # Update w, b
However, doing so, my loss is always the same, and investigating shows my w and b never actually change. I'm a bit lost at what's going on here.
Ultimately, I'd like to be able to store the results of the "new" w and b to compare across iterations and datasets.
It looks like a case of cargo programming to me.
Notice that your Model class doesn't make use of self in forward, so it is effectively a "regular" (non-method) function, and model is entirely stateless. The simplest fix to your code is to make optimizer aware of w and b, by creating it as optimizer = torch.optim.SGD([w, b], lr=0.01). I also rewrite model to be a function
import torch
import torch.nn as nn
# torch.autograd.Variable is roughly equivalent to requires_grad=True
# and is deprecated in PyTorch 1.0
# your code gives not reason to have `requires_grad=True` on `x_data`
x_data = torch.tensor( [ [1.0, 2.0], [2.0, 3.0], [3.0, 4.0] ])
y_data = torch.tensor( [ [2.0], [4.0], [6.0] ] )
w = torch.randn( 2, 1, requires_grad=True )
b = torch.randn( 1, 1, requires_grad=True )
def model(x2, w2, b2):
return x2 # w2 + b2
criterion = torch.nn.MSELoss( size_average=False )
optimizer = torch.optim.SGD([w, b], lr=0.01 )
for epoch in range(10) :
y_pred = model( x_data,w,b )
loss = criterion( y_pred, y_data )
print( epoch, loss.data.item() )
optimizer.zero_grad()
loss.backward()
optimizer.step()
That being said, nn.Linear is built to simplify this procedure. It automatically creates an equivalent of both w and b, called self.weight and self.bias, respectively. Also, self.__call__(x) is equivalent to the definition of forward of your Model, in that it returns self.weight # x + self.bias. In other words, you can also use alternative code
import torch
import torch.nn as nn
x_data = torch.tensor( [ [1.0, 2.0], [2.0, 3.0], [3.0, 4.0] ] )
y_data = torch.tensor( [ [2.0], [4.0], [6.0] ] )
model = nn.Linear(2, 1)
criterion = torch.nn.MSELoss( size_average=False )
optimizer = torch.optim.SGD(model.parameters(), lr=0.01 )
for epoch in range(10) :
y_pred = model(x_data)
loss = criterion( y_pred, y_data )
print( epoch, loss.data.item() )
optimizer.zero_grad()
loss.backward()
optimizer.step()
where model.parameters() can be used to enumerate model parameters (equivalent to the manually created list [w, b] above). To access your parameters (load, save, print, whatever) use model.weight and model.bias.
As a beginner in Machine Learning, I want to write a precision_recall function which computes the precision and recall. However, I have to use a third parameter of the function and I do not know how to do that. How do I fix the following code?
def precision_recall(y_true, y_pred, third):
return precision_score(y_true, y_pred), recall_score(y_true, y_pred)
Thus, how should I change the code that a given class is extracted from the arrays?
You can do something like this:
import numpy as np
from sklearn.metrics import precision_score, recall_score
def precision_recall(y_true, y_pred, scalar):
class_true = (y_true == scalar)
class_pred = (y_pred == scalar)
return precision_score(class_true, class_pred), recall_score(class_true, class_pred)
true = np.array(['red', 'green', 'blue', 'red', 'green'])
pred = np.array(['red', 'green', 'red', 'red', 'red'])
print(precision_recall(true, pred, 'red'))
print(precision_recall(true, pred, 'green'))
Output:
(0.5, 1.0)
(1.0, 0.5)
I am using statsmodels OLS to fit a series of points to a line:
import statsmodels.api as sm
Y = [1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15]
X = [[73.759999999999991], [73.844999999999999], [73.560000000000002],
[73.209999999999994], [72.944999999999993], [73.430000000000007],
[72.950000000000003], [73.219999999999999], [72.609999999999999],
[74.840000000000003], [73.079999999999998], [74.125], [74.75],
[74.760000000000005]]
ols = sm.OLS(Y, X)
r = ols.fit()
preds = r.predict()
print preds
And I get the following results:
[ 7.88819844 7.89728869 7.86680961 7.82937917 7.80103898 7.85290687
7.8015737 7.83044861 7.76521269 8.00369809 7.81547643 7.92723304
7.99407312 7.99514256]
These are an about 10 times off. What am I doing wrong? I tried adding a constant, that just makes the values 1000 times bigger. I don't know much about statistics, so maybe there is something I need to do with the data?
I think you have switched your response and your predictor, like Michael Mayer suggested in his comment. If you plot the data with predictions from your model, you get something like this:
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
Y = np.array([1,2,3,4,5,6,7,8,9,11,12,13,14,15])
X = np.array([ 73.76 , 73.845, 73.56 , 73.21 , 72.945, 73.43 , 72.95 ,
73.22 , 72.61 , 74.84 , 73.08 , 74.125, 74.75 , 74.76 ])
Design = np.column_stack((np.ones(14), X))
ols = sm.OLS(Y, Design).fit()
preds = ols.predict()
plt.plot(X, Y, 'ko')
plt.plot(X, preds, 'k-')
plt.show()
If you switch X and Y, which is what I think you want, you get:
Design2 = np.column_stack((np.ones(14), Y))
ols2 = sm.OLS(X, Design2).fit()
preds2 = ols2.predict()
print preds2
[ 73.1386399 73.21305699 73.28747409 73.36189119 73.43630829
73.51072539 73.58514249 73.65955959 73.73397668 73.88281088
73.95722798 74.03164508 74.10606218 74.18047927]
plt.plot(Y, X, 'ko')
plt.plot(Y, preds2, 'k-')
plt.show()