Why is my stepwise regression causing a shorter output? - regression

I'm trying to run a stepwise regression but the output that I'm getting is shorter than the input data frame. I can't share my data unfortunately but any help would be much appreciated. Thank you in advance!
#training data
a3<-na.omit(train_occur)
sum(is.na(train_occur))
> 0
dim(a3)
>2228 10
full_log<-glm(formula = occurrence ~ . , family=binomial(link=logit), data= train_occur, control = list(maxit = 50))
back_log_occur<-step(full_log)
length(back_log_occur$fitted.values)
>66
#test data
dim(test_occur) #I took out the response variable although I found it doesn't seem to matter whether or not the response variable is there...
>243 9
pred_back_log_occur<-predict(object=back_log_occur,data=test_occur,type="response")
length(pred_back_log_occur)
> 66
I expected 2228 fitted values for the training and 243 predicted values for the test set.

Related

anova_test not returning Mauchly's for three way within subject ANOVA

I am using a data set called sleep (found here: https://drive.google.com/file/d/15ZnsWtzbPpUBQN9qr-KZCnyX-0CYJHL5/view) to run a three way within subject ANOVA comparing Performance based on Stimulation, Deprivation, and Time. I have successfully done this before using anova_test from rstatix. I want to look at the sphericity output but it doesn't appear in the output. I have got it to come up with other three way within subject datasets, so I'm not sure why this is happening. Here is my code:
anova_test(data = sleep, dv = Performance, wid = Subject, within = c(Stimulation, Deprivation, Time))
I also tried to save it to an object and use get_anova_table, but that didn't look any different.
sleep_aov <- anova_test(data = sleep, dv = Performance, wid = Subject, within = c(Stimulation, Deprivation, Time))
get_anova_table(sleep_aov, correction = "GG")
This is an ideal dataset I pulled from the internet, so I'm starting to think the data had a W of 1 (perfect sphericity) and so rstatix is skipping this output. Is this something anova_test does?
Here also is my code using a dataset that does return Mauchly's:
weight_loss_long <- pivot_longer(data = weightloss, cols = c(t1, t2, t3), names_to = "time", values_to = "loss")
weight_loss_long$time <- factor(weight_loss_long$time)
anova_test(data = weight_loss_long, dv = loss, wid = id, within = c(diet, exercises, time))
Not an expert at all, but it might be because your factors have only two levels.
From anova_summary() help:
"Value
return an object of class anova_test a data frame containing the ANOVA table for independent measures ANOVA. However, for repeated/mixed measures ANOVA, it is a list containing the following components are returned:
ANOVA: a data frame containing ANOVA results
Mauchly's Test for Sphericity: If any within-Ss variables with more than 2 levels are present, a data frame containing the results of Mauchly's test for Sphericity. Only reported for effects that have more than 2 levels because sphericity necessarily holds for effects with only 2 levels.
Sphericity Corrections: If any within-Ss variables are present, a data frame containing the Greenhouse-Geisser and Huynh-Feldt epsilon values, and corresponding corrected p-values. "

Columns of Data Frame are Being Swapped: Why is my loop switching the column values when I identify and assign the columns by name?

I need help with the specific code I will paste below. I am using the Ames Housing data set collected by Dean De Cock.
I am using a Python notebook and editing thru Anaconda's Jupyter Lab 2.1.5.
The code below is supposed to replace all np.nan or "None" values. For some reason,
after repeatedly calling a hand-made function inside a for loop, the columns of the resulting data frame get swapped around.
Note: I am aware I could do this with an "imputer." I plan to select numeric and object type features, impute them separately then put them back together. As a side-note, is there any way I can do that while having the details I output manually using text displayed or otherwise verified?
In the cell in question, the flow is:
Get and assign the number of data points in the data frame df_train.
Get and assign a series that lists the count of null values in df_train. The syntax is sr_null_counts = df_train.isnull().sum().
Create an empty list to which names of features that have 5% of their values equal to null are appended. They will be dropped later,
outside the for loop. I thought at first that this was the problem since the command to drop the columns of df_train in-place
used to be within the for-loop.
Repeatedly call a hand-made function to impute columns with null values not exceeding 5% of the row count for df_train.
I used a function that has a for-loop and nested try-except statements to:
Accept a series and, optionally, the series' name when it was a column in a dataframe. It assigns a copy of the passed series
to a local variable.
In the exact order, (a) try to replace all null (NaN or None) values with the mean of the passed series.
(b) If that fails, try to replace all null values with the median of the series.
(c) If even that fails, replace all null values with the mode of the series.
Return the edited copy of the series with all null values replaced. It should also print out strings that tell me what feature
was modified and what summary statistic was used to replace/impute the missing values.
The final line is to drop all the columns marked as having more than 5% missing values.
Here is the full code:
Splitting the main dataframe into a train and test set.
The full data-set was loaded thru df_housing = pd.read_csv(sep = '\t', filepath_or_buffer = "AmesHousing.tsv").
def make_traintest(df, train_fraction = 0.7, random_state_val = 88):
df = df.copy()
df_train = df.sample(frac = train_fraction, random_state = random_state_val)
bmask_istrain = df.index.isin(df_train.index.values)
df_test = df.loc[ ~bmask_istrain ]
return {
"train":df_train,
"test":df_test
}
dict_traintest = make_traintest(df = df_housing)
df_train = dict_traintest["train"]
df_test = dict_traintest["test"]
Get a List of Columns With Null Values
lst_have_nulls = []
for feature in df_housing.columns.values.tolist():
nullcount = df_housing[feature].isnull().sum()
if nullcount > 0:
lst_have_nulls.append(feature)
print(feature, "\n=====\nNull Count:\t", nullcount, '\n', df_housing[feature].value_counts(dropna = False),'\n*****')
Definition of the hand-made function:
def impute_series(sr_values, feature_name = ''):
sr_out = sr_values.copy()
try:
sr_out.fillna(value = sr_values.mean())
print("Feature", feature_name, "imputed with mean:", sr_values.mean())
except Exception as e:
print("Filling NaN values with mean of feature", feature_name, "caused an error:\n", e)
try:
sr_out.fillna(value = sr_values.median())
print("Feature", feature_name, "imputed with median:", sr_values.median())
except Exception as e:
print("Filling NaN values with median for feature", feature_name, "caused an error:\n", e)
sr_out.fillna(value = sr_values.mode())
print("Feature", feature_name, "imputed with mode:", sr_values.mode())
return sr_out
For-Loop
Getting the count of null values, defining the empty list of columns to drop to allow appending, and repeatedly
doing the following: For every column in lst_have_nulls, check if the column has equal, less or more than 5% missing values.
If more, append the column to lst_drop. Else, call the hand-made imputing function. After the for-loop, drop all columns in
lst_drop, in-place.
Where did I go wrong? In case you need the entire notebook, I have uploaded it to Kaggle. Here is a link.
https://www.kaggle.com/joachimrives/ames-housing-public-problem
Update: Problem Still Exists After Testing Anvar's Answer with Changes
When I tried the code of Anvar Kurmukov, my dataframe column values still got swapped. The change I made was adding int and float to the list of dtypes to check. The changes are inside the for-loop:
if dtype in [np.int64, np.float64, int, float].
It may be a problem with another part of my code in the full notebook. I will need to check where it is by calling df_train.info() cell by cell from the top. I tested the code in the notebook I made public. It is in cell 128. For some reason, after running Anvar's code, the df_train.info() method returned this:
1st Flr SF 2nd Flr SF 3Ssn Porch Alley Bedroom AbvGr Bldg Type Bsmt Cond Bsmt Exposure Bsmt Full Bath Bsmt Half Bath ... Roof Style SalePrice Screen Porch Street TotRms AbvGrd Total Bsmt SF Utilities Wood Deck SF Year Built Year Remod/Add
1222 1223 534453140 70 RL 50.0 4882 Pave NaN IR1 Bnk ... 0 0 0 0 0 NaN NaN NaN 0 87000
1642 1643 527256040 20 RL 81.0 13870 Pave NaN IR1 HLS ... 52 0 0 174 0 NaN NaN NaN 0 455000
1408 1409 905427050 50 RL 66.0 21780 Pave NaN Reg Lvl ... 36 0 0 144 0 NaN NaN NaN 0 185000
1729 1730 528218050 60 RL 65.0 10237 Pave NaN Reg Lvl ... 72 0 0 0 0 NaN NaN NaN 0 178900
1069 1070 528180110 120 RL 58.0 10110 Pave NaN IR1 Lvl ... 48 0 0 0 0 NaN NaN NaN 0 336860
tl;dr instead of try: except you should simply use if and check dtype of the column; you do not need to iterate over columns.
drop_columns = df.columns[df.isna().sum() / df.shape[0] > 0.05]
df.drop(drop_columns, axis=1)
num_columns = []
cat_columns = []
for col, dtype in df.dtypes.iteritems():
if dtype in [np.int64, np.float64]:
num_columns.append(col)
else:
cat_columns.append(col)
df[num_columns] = df[num_columns].fillna(df[num_columns].mean())
df[cat_columns] = df[cat_columns].fillna(df[cat_columns].mode())
Short comment on make_traintest function: I would simply return 2 separate DataFrames instead of a dictionary or use sklearn.model_selection.train_test_split.
upd. You can check for number of NaN values in a column, but it is unnecessary if your only goal is to impute NaNs.
Answer
I discovered the answer as to why my columns were being swapped. They were not actually being swapped. The original problem was that I had not set the "Order" column as the index column. To fix the problem on the notebook in my PC, I simply added the following paramater and value to pd.read_csv: index_col = "Order". That fixed the problem on my local notebook. When I tried it on the Kaggle notebook, however, it did not fix the problem
The version of the Ames Housing data set I first used on the notebook - for some reason - was also the cause for the column swapping.
Anvar's Code is fine. You may test the code I wrote, but to be safe, defer to Anvar's code. Mine is still to be tested.
Testing Done
I modified the Kaggle notebook I linked in my question. I used the data set I was actually working in with my PC. When I did that, the code given by Anvar Kurmukov's answer worked perfectly. I tested my own code and it seems fine, but test both versions before trying. I only reviewed the data sets using head() and manually checked the column inputs. If you want to check the notebook, here it is:
https://www.kaggle.com/joachimrives/ames-housing-public-problem/
To test if the data set was at fault, I created to data frames. One was taken directly from my local file uploaded to Kaggle. The other used the current version of the Ames Iowa Housing data set I had used as input. The columns were properly "aligned" with their expected input. To find the expected column values, I used this source:
http://jse.amstat.org/v19n3/decock/DataDocumentation.txt
Here are the screenshots of the different results I got when I swapped data sets:
With an uploaded copy of my local file:
With the original AmesHousing.csv From Notebook Version 1:
The data set I Used that Caused the Column-swap on the Kaggle Notebook
https://www.kaggle.com/marcopale/housing

How to get dataset into array

I have worked all the tutorials and searched for "load csv tensorflow" but just can't get the logic of it all. I'm not a total beginner, but I don't have much time to complete this, and I've been suddenly thrown into Tensorflow, which is unexpectedly difficult.
Let me lay it out:
Very simple CSV file of 184 columns that are all float numbers. A row is simply today's price, three buy signals, and the previous 180 days prices
close = tf.placeholder(float, name='close')
signals = tf.placeholder(bool, shape=[3], name='signals')
previous = tf.placeholder(float, shape=[180], name = 'previous')
This article: https://www.tensorflow.org/guide/datasets
It covers how to load pretty well. It even has a section on changing to numpy arrays, which is what I need to train and test the 'net. However, as the author says in the article leading to this Web page, it is pretty complex. It seems like everything is geared toward doing data manipulation, where we have already normalized our data (nothing has really changed in AI since 1983 in terms of inputs, outputs, and layers).
Here is a way to load it, but not in to Numpy and no example of not manipulating the data.
with tf.Session as sess:
sess.run( tf.global variables initializer())
with open('/BTC1.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter =',')
line_count = 0
for row in csv_reader:
?????????
line_count += 1
I need to know how to get the csv file in to the
close = tf.placeholder(float, name='close')
signals = tf.placeholder(bool, shape=[3], name='signals')
previous = tf.placeholder(float, shape=[180], name = 'previous')
so that I can follow the tutorials to train and test the net.
It's not that clear for me your question. You might be answering, tell me if I'm wrong, how to feed data in your model? There are several fashions to do so.
Use placeholders with feed_dict during the session. This is the basic and easier one but often suffers from training performance issue. Further explanation, check this post.
Use queue. Hard to implement and badly documented, I don't suggest, because it's been taken over by the third method.
tf.data API.
...
So to answer your question by the first method:
# get your array outside the session
with open('/BTC1.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter =',')
dataset = np.asarray([data for data in csv_reader])
close_col = dataset[:, 0]
signal_cols = dataset[:, 1: 3]
previous_cols = dataset[:, 3:]
# let's say you load 100 row each time for training
batch_size = 100
# define placeholders like you
...
with tf.Session() as sess:
...
for i in range(number_iter):
start = i * batch_size
end = (i + 1) * batch_size
sess.run(train_operation, feed_dict={close: close_col[start: end, ],
signals: signal_col[start: end, ],
previous: previous_col[start: end, ]
}
)
By the third method:
# retrieve your columns like before
...
# let's say you load 100 row each time for training
batch_size = 100
# construct your input pipeline
c_col, s_col, p_col = wrapper(filename)
batch = tf.data.Dataset.from_tensor_slices((close_col, signal_col, previous_col))
batch = batch.shuffle(c_col.shape[0]).batch(batch_size) #mix data --> assemble batches --> prefetch to RAM and ready inject to model
iterator = batch.make_initializable_iterator()
iter_init_operation = iterator.initializer
c_it, s_it, p_it = iterator.get_next() #get next batch operation automatically called at each iteration within the session
# replace your close, signal, previous placeholder in your model by c_it, s_it, p_it when you define your model
...
with tf.Session() as sess:
# you need to initialize the iterators
sess.run([tf.global_variable_initializer, iter_init_operation])
...
for i in range(number_iter):
start = i * batch_size
end = (i + 1) * batch_size
sess.run(train_operation)
Good luck!

Reading variable length cell arrays into matrix

I am using Octave 4.2 and using xlsread in a for loop to import data from several different RTDs. I am importing using the following code:
for i=rtdmin:rtdmax
filnum=num2str(i);
fid = strcat(pre, filnum, filtyp);
j = exist(fid);
if j == 2
[num{i}, txt{i}, raw{i}, lim{i}] = xlsread(fid);
time{i} = num{i}(:,2);
temp{i} = num{i}(:,3);
endif
endfor
The problem is none of the RTDs have the exact same number of readings (30,000 +-200), or stop and start at the exact same time, although the readings overlap. Because of the variable size of the data in each cell I cannot simply pull it out into a matrix in order to process the data. Can anyone suggest a solution of how to get the data into a matrix, or can suggest a change to the existing code so the data is read into a matrix to begin with. Thank you in advance.

How do I write a function that takes the average of a list of numbers

I want to avoid importing different modules as that is mostly what I have found while looking online. I am stuck with this bit of code and I don't really know how to fix it or improve on it. Here's what I've got so far.
def avg(lst):
'''lst is a list that contains lists of numbers; the
function prints, one per line, the average of each list'''
for i[0:-1] in lst:
return (sum(i[0:-1]))//len(i)
Again, I'm quite new and this for loops jargon is quite confusing to me, so if someone could help me get it so the output of, say, a list of grades would be different lines containing the averages. So if for lst I inserted grades = [[95,92,86,87], [66,54], [89,72,100], [33,0,0]], it would have 4 lines that all had the averages of those sublists. I also am to assume in the function that the sublists could have any amount of grades, but I can assume that the lists have non-zero values.
Edit1: # jramirez, could you explain what that is doing differently than mine possible? I don't doubt that it is better or that it will work but I still don't really understand how to recreate this myself... regardless, thank you.
I think this is what you want:
def grade_average(grades):
for grade in grades:
avg = 0
for num in grade:
avg += num
avg = avg / len(grade)
print ("Average for " + str(grade) + " is = " + str(avg))
if __name__ == '__main__':
grades = [[95,92,86,87],[66,54],[89,72,100],[33,0,0]]
grade_average(grades)
Result:
Average for [95, 92, 86, 87] is = 90.0
Average for [66, 54] is = 60.0
Average for [89, 72, 100] is = 87.0
Average for [33, 0, 0] is = 11.0
Problems with your code: the extraneous indexing of i; the use of // to truncate he averate (use round if you want to round it); and the use of return in the loop, so it would stop after the first average. Your docstring says 'print' but you return instead. This is actually a good thing. Functions should not print the result they calculate, as that make the answer inaccessible to further calculation. Here is how I would write this, as a generator function.
def averages(gradelists):
'''Yield average for each gradelist.'''
for glist in gradelists:
yield sum(glist) /len(glist)
print(list(averages(
[[95,92,86,87], [66,54], [89,72,100], [33,0,0]])))
[90.0, 60.0, 87.0, 11.0]
To return a list, change the body of the function to (beginner version)
ret = []
for glist in gradelists:
ret.append(sum(glist) /len(glist))
return ret
or (more advanced, using list comprehension)
return [sum(glist) /len(glist) for glist in gradelists]
However, I really recommend learning about iterators, generators, and generator functions (defined with yield).