Ordinary least squares regression giving wrong prediction - regression

I am using statsmodels OLS to fit a series of points to a line:
import statsmodels.api as sm
Y = [1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15]
X = [[73.759999999999991], [73.844999999999999], [73.560000000000002],
[73.209999999999994], [72.944999999999993], [73.430000000000007],
[72.950000000000003], [73.219999999999999], [72.609999999999999],
[74.840000000000003], [73.079999999999998], [74.125], [74.75],
[74.760000000000005]]
ols = sm.OLS(Y, X)
r = ols.fit()
preds = r.predict()
print preds
And I get the following results:
[ 7.88819844 7.89728869 7.86680961 7.82937917 7.80103898 7.85290687
7.8015737 7.83044861 7.76521269 8.00369809 7.81547643 7.92723304
7.99407312 7.99514256]
These are an about 10 times off. What am I doing wrong? I tried adding a constant, that just makes the values 1000 times bigger. I don't know much about statistics, so maybe there is something I need to do with the data?

I think you have switched your response and your predictor, like Michael Mayer suggested in his comment. If you plot the data with predictions from your model, you get something like this:
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
Y = np.array([1,2,3,4,5,6,7,8,9,11,12,13,14,15])
X = np.array([ 73.76 , 73.845, 73.56 , 73.21 , 72.945, 73.43 , 72.95 ,
73.22 , 72.61 , 74.84 , 73.08 , 74.125, 74.75 , 74.76 ])
Design = np.column_stack((np.ones(14), X))
ols = sm.OLS(Y, Design).fit()
preds = ols.predict()
plt.plot(X, Y, 'ko')
plt.plot(X, preds, 'k-')
plt.show()
If you switch X and Y, which is what I think you want, you get:
Design2 = np.column_stack((np.ones(14), Y))
ols2 = sm.OLS(X, Design2).fit()
preds2 = ols2.predict()
print preds2
[ 73.1386399 73.21305699 73.28747409 73.36189119 73.43630829
73.51072539 73.58514249 73.65955959 73.73397668 73.88281088
73.95722798 74.03164508 74.10606218 74.18047927]
plt.plot(Y, X, 'ko')
plt.plot(Y, preds2, 'k-')
plt.show()

Related

Sales Prediction error on catboost regression/CatBoostRegressor

d = {'customer':['A','B','C','A'],'season':[1,2,3,4],
'cat1': ['BAGS','TSHIRT','DRESS','BELT'],
'cat2': ['high','low','high','medium'],'sale': [10,20,15,50]}
df = pd.DataFrame(data=d)
df
Desired output on season 5
d = {'customer':['A','B','C','A'],'season': [5,5,5,5],
'cat1': ['BAGS','TSHIRT','DRESS','BELT'],
'cat2': ['high','low','high','medium'],'sale': [?,?,?,?]}
df = pd.DataFrame(data=d)
df
I tried
df=df.groupby(['customer','season','cat1','cat2'])['Sales'].sum().sort_values(ascending=False).reset_index()
from sklearn.model_selection import train_test_split
X=df[['customer','season','cat1','cat2']]
y=df[['Sales']]
X.season=X.season.astype(float)
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size = 0.90, random_state =42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, train_size = 0.85, random_state =42)
categorical_features_indices = np.where(X.dtypes != np.float)[0]
import catboost
from catboost import MetricVisualizer, Pool, CatBoostRegressor, cv
train_pool = Pool(data=X_train, label=y_train, cat_features=categorical_features_indices)
val_pool = Pool(data=X_val, label=y_val, cat_features=categorical_features_indices)
test_pool = Pool(data=X_test, label=y_test, cat_features=categorical_features_indices)
params = {
'iterations':900,
'loss_function': 'RMSE',
'learning_rate': 0.0109, #1 0.102,
'depth': 6,
'l2_leaf_reg': 6,
'border_count': 7,
'thread_count': 7,
'bagging_temperature': 2,
'random_strength': 2.23,
'colsample_bylevel': 0.85,
'custom_metric': ['MAPE', 'R2'],
'eval_metric': 'R2',
'random_seed': 41,
'max_ctr_complexity': 2,
'logging_level': 'Silent',
'use_best_model':False # Takes
}
reg_model = CatBoostRegressor(**params)
reg_model.fit(train_pool, eval_set=val_pool, plot=True, verbose=100)
X['season']=5
X['Predict_sales']=reg_model.predict(X)
The above code throws no error.
My Question is: My predict values doesn't change if input 5,6,7,8 however season is a continuous value. What am I doing wrong and how can i predict for season 6, 7, 8 and so on.
catboost is a tree-based model. Regression trees (as well as decision trees) partition the feature space and each partition yields the same value. Since neither season 5,6,7 or 8 occurred in the training data it should all land in the same partition and hence yielding exactly the same value.
You might need to go for another model type (e.g. linear regression). What kind of relationship would you expect between season and sales? Predicting on something you haven't seen in your training data always is hard (except if there is something like a linear relationship)

Linear Regression using sklearn

I have a model fitted with data but having trouble using the predict function.
d = {'df_Size': [1, 3, 5, 8, 10, 15, 18], 'RAM': [3676, 6532, 9432, 13697, 16633, 23620, 27990]}
df = pd.DataFrame(data=d)
df
X = np.array(df['df_Size']).reshape(-1, 1)
y = np.array(df['RAM']).reshape(-1, 1)
model = LinearRegression()
model.fit(X, y)
print(regr.score(X, y))
then when I try to predict on
X_Size = 25
X_Size
prediction = model.predict(X_Size)
I get the following error
ValueError: Expected 2D array, got scalar array instead:
array=25.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I think I am passing the 25 in the wrong format but would just like some help on getting the response for Ram considering the 25 rows.
Thanks,
You need to pass the predictor in the same shape (basically 1 column):
X.shape
Out[11]: (7, 1)
You can do:
model.predict(np.array(25).reshape(1,1))

define a function in which min() is used

I am trying to define a function in which I want a part of the function limited. I try to do this by using min() but it returns
The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
My code:
def f(x, beta):
K_w = (1+((0.5*D)/(0.5*D+x))**2)**2
K_c = min(11,(3.5*(x/D)**(-0.5))) # <-- this is what gives me the problem. It should limit K_c to 11, but that does not work.
K_tot = (K_c**2+K_w**2+2*K_c*K_w*np.cos(beta))**0.5
return K_tot
x = np.linspace(0, 50, 100)
beta = np.linspace(0, 3.14, 180)
X, Y = np.meshgrid(x, beta)
Z = f(X, Y)
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.contour3D(X, Y, Z, 100, cmap = 'viridis')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z');
I expected K_c to be limited to 11, but it gave a
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I might be making a rookie mistake, but help is much appreciated!
Consider using np.clip of which its references can be found here.
np.clip(3.5*(x/D)**(-0.5), None, 11)
for your case.
For example,
>>> import numpy as np
>>> np.clip([1, 2, 3, 15], None, 11)
array([ 1, 2, 3, 11])
The problem with your code is that min is comparing a number with a list of which this is not expected.
Alternatively, here is a list comprehension approach:
A = [1, 2, 3, 15]
B = [min(11, a) for a in A]
print(B)

Pytorch Not Updating Variables in .step()

I'm attempting to convert old code to PyTorch code as an experiment. Ultimately, I will be doing regression on a 10,000+ x 100 Matrix, updating weights and whatnot appropriately.
Trying to learn, I'm slowly scaling up on toy examples. I'm hitting a wall with the following sample code.
import torch
import torch.nn as nn
import torch.nn.functional as funct
from torch.autograd import Variable
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
x_data = Variable( torch.Tensor( [ [1.0, 2.0], [2.0, 3.0], [3.0, 4.0] ] ),
requires_grad=True )
y_data = Variable( torch.Tensor( [ [2.0], [4.0], [6.0] ] ) )
w = Variable( torch.randn( 2, 1, requires_grad=True ) )
b = Variable( torch.randn( 1, 1, requires_grad=True ) )
class Model(torch.nn.Module) :
def __init__(self) :
super( Model, self).__init__()
self.linear = torch.nn.Linear(2,1) ## 2 features per entry. 1 output
def forward(self, x2, w2, b2) :
y_pred = x2 # w2 + b2
return y_pred
model = Model()
criterion = torch.nn.MSELoss( size_average=False )
optimizer = torch.optim.SGD( model.parameters(), lr=0.01 )
for epoch in range(10) :
y_pred = model( x_data,w,b ) # Get prediction
loss = criterion( y_pred, y_data ) # Calc loss
print( epoch, loss.data.item() ) # Print loss
optimizer.zero_grad() # Zero gradient
loss.backward() # Calculate gradients
optimizer.step() # Update w, b
However, doing so, my loss is always the same, and investigating shows my w and b never actually change. I'm a bit lost at what's going on here.
Ultimately, I'd like to be able to store the results of the "new" w and b to compare across iterations and datasets.
It looks like a case of cargo programming to me.
Notice that your Model class doesn't make use of self in forward, so it is effectively a "regular" (non-method) function, and model is entirely stateless. The simplest fix to your code is to make optimizer aware of w and b, by creating it as optimizer = torch.optim.SGD([w, b], lr=0.01). I also rewrite model to be a function
import torch
import torch.nn as nn
# torch.autograd.Variable is roughly equivalent to requires_grad=True
# and is deprecated in PyTorch 1.0
# your code gives not reason to have `requires_grad=True` on `x_data`
x_data = torch.tensor( [ [1.0, 2.0], [2.0, 3.0], [3.0, 4.0] ])
y_data = torch.tensor( [ [2.0], [4.0], [6.0] ] )
w = torch.randn( 2, 1, requires_grad=True )
b = torch.randn( 1, 1, requires_grad=True )
def model(x2, w2, b2):
return x2 # w2 + b2
criterion = torch.nn.MSELoss( size_average=False )
optimizer = torch.optim.SGD([w, b], lr=0.01 )
for epoch in range(10) :
y_pred = model( x_data,w,b )
loss = criterion( y_pred, y_data )
print( epoch, loss.data.item() )
optimizer.zero_grad()
loss.backward()
optimizer.step()
That being said, nn.Linear is built to simplify this procedure. It automatically creates an equivalent of both w and b, called self.weight and self.bias, respectively. Also, self.__call__(x) is equivalent to the definition of forward of your Model, in that it returns self.weight # x + self.bias. In other words, you can also use alternative code
import torch
import torch.nn as nn
x_data = torch.tensor( [ [1.0, 2.0], [2.0, 3.0], [3.0, 4.0] ] )
y_data = torch.tensor( [ [2.0], [4.0], [6.0] ] )
model = nn.Linear(2, 1)
criterion = torch.nn.MSELoss( size_average=False )
optimizer = torch.optim.SGD(model.parameters(), lr=0.01 )
for epoch in range(10) :
y_pred = model(x_data)
loss = criterion( y_pred, y_data )
print( epoch, loss.data.item() )
optimizer.zero_grad()
loss.backward()
optimizer.step()
where model.parameters() can be used to enumerate model parameters (equivalent to the manually created list [w, b] above). To access your parameters (load, save, print, whatever) use model.weight and model.bias.

How can I use three Conv1d on the three axis of my 3*n matrix in Pytorch?

The following is my CNN. The input of it is a (3,64) matrix, I want to use three convolution kernels to process the x,y,z axis respectively.
class Char_CNN(nn.Module):
def __init__(self):
super(Char_CNN, self).__init__()
self.convdx = nn.Conv1d(1, 12, 20)
self.convdy = nn.Conv1d(1, 12, 20)
self.convdz = nn.Conv1d(1, 12, 20)
self.fc1 = nn.Linear(540, 1024)
self.fc2 = nn.Linear(1024, 30)
self.fc3 = nn.Linear(30, 13)
def forward(self, x):
after_convd = [self.convdx(x[:, :, 0]), self.convdy(x[:, :, 1]), self.convdz(x[:, :, 2])]
after_pool = [F.max_pool1d(F.relu(value), 3) for value in after_convd]
x = torch.cat(after_pool, 1)
x = x.view(x.size(0), -1)
x = self.fc1(x)
x = self.fc2(x)
x = self.fc3(x)
x = F.softmax(x)
return x
But during the running of loss = criterion(out, target), a RunTime Error occurs:
RuntimeError: Assertion `cur_target >= 0 && cur_target < n_classes' failed.
I'm very new to pytorch so that I cannot find out the mistake of my code.
Can you help me?
The way of convolution is okay. The problem is my labels were between 1 and 13, and the correct range is 0 to 12.
After modifying it, my CNN works successfully.
But as a fresher to Pytorch and deep learning, I guess my convolution mode can be clearer and easier. Welcome to point out my errors!