Columns of Data Frame are Being Swapped: Why is my loop switching the column values when I identify and assign the columns by name? - function

I need help with the specific code I will paste below. I am using the Ames Housing data set collected by Dean De Cock.
I am using a Python notebook and editing thru Anaconda's Jupyter Lab 2.1.5.
The code below is supposed to replace all np.nan or "None" values. For some reason,
after repeatedly calling a hand-made function inside a for loop, the columns of the resulting data frame get swapped around.
Note: I am aware I could do this with an "imputer." I plan to select numeric and object type features, impute them separately then put them back together. As a side-note, is there any way I can do that while having the details I output manually using text displayed or otherwise verified?
In the cell in question, the flow is:
Get and assign the number of data points in the data frame df_train.
Get and assign a series that lists the count of null values in df_train. The syntax is sr_null_counts = df_train.isnull().sum().
Create an empty list to which names of features that have 5% of their values equal to null are appended. They will be dropped later,
outside the for loop. I thought at first that this was the problem since the command to drop the columns of df_train in-place
used to be within the for-loop.
Repeatedly call a hand-made function to impute columns with null values not exceeding 5% of the row count for df_train.
I used a function that has a for-loop and nested try-except statements to:
Accept a series and, optionally, the series' name when it was a column in a dataframe. It assigns a copy of the passed series
to a local variable.
In the exact order, (a) try to replace all null (NaN or None) values with the mean of the passed series.
(b) If that fails, try to replace all null values with the median of the series.
(c) If even that fails, replace all null values with the mode of the series.
Return the edited copy of the series with all null values replaced. It should also print out strings that tell me what feature
was modified and what summary statistic was used to replace/impute the missing values.
The final line is to drop all the columns marked as having more than 5% missing values.
Here is the full code:
Splitting the main dataframe into a train and test set.
The full data-set was loaded thru df_housing = pd.read_csv(sep = '\t', filepath_or_buffer = "AmesHousing.tsv").
def make_traintest(df, train_fraction = 0.7, random_state_val = 88):
df = df.copy()
df_train = df.sample(frac = train_fraction, random_state = random_state_val)
bmask_istrain = df.index.isin(df_train.index.values)
df_test = df.loc[ ~bmask_istrain ]
return {
"train":df_train,
"test":df_test
}
dict_traintest = make_traintest(df = df_housing)
df_train = dict_traintest["train"]
df_test = dict_traintest["test"]
Get a List of Columns With Null Values
lst_have_nulls = []
for feature in df_housing.columns.values.tolist():
nullcount = df_housing[feature].isnull().sum()
if nullcount > 0:
lst_have_nulls.append(feature)
print(feature, "\n=====\nNull Count:\t", nullcount, '\n', df_housing[feature].value_counts(dropna = False),'\n*****')
Definition of the hand-made function:
def impute_series(sr_values, feature_name = ''):
sr_out = sr_values.copy()
try:
sr_out.fillna(value = sr_values.mean())
print("Feature", feature_name, "imputed with mean:", sr_values.mean())
except Exception as e:
print("Filling NaN values with mean of feature", feature_name, "caused an error:\n", e)
try:
sr_out.fillna(value = sr_values.median())
print("Feature", feature_name, "imputed with median:", sr_values.median())
except Exception as e:
print("Filling NaN values with median for feature", feature_name, "caused an error:\n", e)
sr_out.fillna(value = sr_values.mode())
print("Feature", feature_name, "imputed with mode:", sr_values.mode())
return sr_out
For-Loop
Getting the count of null values, defining the empty list of columns to drop to allow appending, and repeatedly
doing the following: For every column in lst_have_nulls, check if the column has equal, less or more than 5% missing values.
If more, append the column to lst_drop. Else, call the hand-made imputing function. After the for-loop, drop all columns in
lst_drop, in-place.
Where did I go wrong? In case you need the entire notebook, I have uploaded it to Kaggle. Here is a link.
https://www.kaggle.com/joachimrives/ames-housing-public-problem
Update: Problem Still Exists After Testing Anvar's Answer with Changes
When I tried the code of Anvar Kurmukov, my dataframe column values still got swapped. The change I made was adding int and float to the list of dtypes to check. The changes are inside the for-loop:
if dtype in [np.int64, np.float64, int, float].
It may be a problem with another part of my code in the full notebook. I will need to check where it is by calling df_train.info() cell by cell from the top. I tested the code in the notebook I made public. It is in cell 128. For some reason, after running Anvar's code, the df_train.info() method returned this:
1st Flr SF 2nd Flr SF 3Ssn Porch Alley Bedroom AbvGr Bldg Type Bsmt Cond Bsmt Exposure Bsmt Full Bath Bsmt Half Bath ... Roof Style SalePrice Screen Porch Street TotRms AbvGrd Total Bsmt SF Utilities Wood Deck SF Year Built Year Remod/Add
1222 1223 534453140 70 RL 50.0 4882 Pave NaN IR1 Bnk ... 0 0 0 0 0 NaN NaN NaN 0 87000
1642 1643 527256040 20 RL 81.0 13870 Pave NaN IR1 HLS ... 52 0 0 174 0 NaN NaN NaN 0 455000
1408 1409 905427050 50 RL 66.0 21780 Pave NaN Reg Lvl ... 36 0 0 144 0 NaN NaN NaN 0 185000
1729 1730 528218050 60 RL 65.0 10237 Pave NaN Reg Lvl ... 72 0 0 0 0 NaN NaN NaN 0 178900
1069 1070 528180110 120 RL 58.0 10110 Pave NaN IR1 Lvl ... 48 0 0 0 0 NaN NaN NaN 0 336860

tl;dr instead of try: except you should simply use if and check dtype of the column; you do not need to iterate over columns.
drop_columns = df.columns[df.isna().sum() / df.shape[0] > 0.05]
df.drop(drop_columns, axis=1)
num_columns = []
cat_columns = []
for col, dtype in df.dtypes.iteritems():
if dtype in [np.int64, np.float64]:
num_columns.append(col)
else:
cat_columns.append(col)
df[num_columns] = df[num_columns].fillna(df[num_columns].mean())
df[cat_columns] = df[cat_columns].fillna(df[cat_columns].mode())
Short comment on make_traintest function: I would simply return 2 separate DataFrames instead of a dictionary or use sklearn.model_selection.train_test_split.
upd. You can check for number of NaN values in a column, but it is unnecessary if your only goal is to impute NaNs.

Answer
I discovered the answer as to why my columns were being swapped. They were not actually being swapped. The original problem was that I had not set the "Order" column as the index column. To fix the problem on the notebook in my PC, I simply added the following paramater and value to pd.read_csv: index_col = "Order". That fixed the problem on my local notebook. When I tried it on the Kaggle notebook, however, it did not fix the problem
The version of the Ames Housing data set I first used on the notebook - for some reason - was also the cause for the column swapping.
Anvar's Code is fine. You may test the code I wrote, but to be safe, defer to Anvar's code. Mine is still to be tested.
Testing Done
I modified the Kaggle notebook I linked in my question. I used the data set I was actually working in with my PC. When I did that, the code given by Anvar Kurmukov's answer worked perfectly. I tested my own code and it seems fine, but test both versions before trying. I only reviewed the data sets using head() and manually checked the column inputs. If you want to check the notebook, here it is:
https://www.kaggle.com/joachimrives/ames-housing-public-problem/
To test if the data set was at fault, I created to data frames. One was taken directly from my local file uploaded to Kaggle. The other used the current version of the Ames Iowa Housing data set I had used as input. The columns were properly "aligned" with their expected input. To find the expected column values, I used this source:
http://jse.amstat.org/v19n3/decock/DataDocumentation.txt
Here are the screenshots of the different results I got when I swapped data sets:
With an uploaded copy of my local file:
With the original AmesHousing.csv From Notebook Version 1:
The data set I Used that Caused the Column-swap on the Kaggle Notebook
https://www.kaggle.com/marcopale/housing

Related

Use of function / return

I had the task to code the following:
Take a list of integers and returns the value of these numbers added up, but only if they are odd.
Example input: [1,5,3,2]
Output: 9
I did the code below and it worked perfectly.
numbers = [1,5,3,2]
print(numbers)
add_up_the_odds = []
for number in numbers:
if number % 2 == 1:
add_up_the_odds.append(number)
print(add_up_the_odds)
print(sum(add_up_the_odds))
Then I tried to re-code it using function definition / return:
def add_up_the_odds(numbers):
odds = []
for number in range(1,len(numbers)):
if number % 2 == 1:
odds.append(number)
return odds
numbers = [1,5,3,2]
print (sum(odds))
But I couldn’t make it working, anybody can help with that?
Note: I'm going to assume Python 3.x
It looks like you're defining your function, but never calling it.
When the interpreter finishes going through your function definition, the function is now there for you to use - but it never actually executes until you tell it to.
Between the last two lines in your code, you need to call add_up_the_odds() on your numbers array, and assign the result to the odds variable.
i.e. odds = add_up_the_odds(numbers)

Using 2 different outputs of 'return' of a function in separate elements of a plot

I am drawing a plot of voltage per time. For the voltage values, I want the values to be evaluated by a 'scaling' function which converts the values from volts to kilovolts if the biggest element is higher than 1000 volts (11000 volts to 11 KILOvolts).
This function is supposed to return 2 separate outputs; one for (new) values of voltage and one for the unit. The values are fed into the y axis values of the plot and the unit is given to the labeling line of that axis. For example:
import numpy as np
time = np.array([0, 1, 2, 3])
system_voltage1 = np.array([110, 120, 130, 150])
system_voltage2 = np.array([11000, 12000, 13000, 15000])
scaling_function(input)
if np.amax(input) < 1000:
output = input/1
Voltage_label = 'Voltage in Volts'
if np.amax(input) > 1000:
output = input/1000
Voltage_label = 'Voltage in KILOVolts'
return(output, Voltage_label)
fig14 = plt.figure(figsize=(16,9))
ax1 = fig14.add_subplot(111)
l1, = ax1.plot(time, scaling_function(system_voltage), color='r')
ax1.set_xlabel("time in second", color='k')
ax1.set_ylabel(Voltage_label, color='k')
Now, I am having trouble, calling this function properly. I need the function to only receive the output for scaling_function(system_voltage), and receive Voltage_label in ax1.set_ylabel(Voltage_label, color='k'). Now:
A) My problem: I don't know how to write the code so only the first output is received and used for scaling_function(system_voltage) , and the second element for the labeling line.
B) Something I tried but didn't work:Voltage_label does not recognize the value of voltage_label from scaling_function, as it is located in an outer loop than the function. I mean, I cannot access voltage_label as its value is not globally assigned.
Can anyone help me with this?
y,l = scaling_function(system_voltage)
l1, = ax1.plot(time, y, color='r')
ax1.set_xlabel("time in second", color='k')
ax1.set_ylabel(l, color='k')

Why is my stepwise regression causing a shorter output?

I'm trying to run a stepwise regression but the output that I'm getting is shorter than the input data frame. I can't share my data unfortunately but any help would be much appreciated. Thank you in advance!
#training data
a3<-na.omit(train_occur)
sum(is.na(train_occur))
> 0
dim(a3)
>2228 10
full_log<-glm(formula = occurrence ~ . , family=binomial(link=logit), data= train_occur, control = list(maxit = 50))
back_log_occur<-step(full_log)
length(back_log_occur$fitted.values)
>66
#test data
dim(test_occur) #I took out the response variable although I found it doesn't seem to matter whether or not the response variable is there...
>243 9
pred_back_log_occur<-predict(object=back_log_occur,data=test_occur,type="response")
length(pred_back_log_occur)
> 66
I expected 2228 fitted values for the training and 243 predicted values for the test set.

SSRS Pie chart hide 0 Value

SQL Server 2012 - SSRS Questions
I currently have a Pie chart that shows the number of deliveries as a percentage on whether they are late, on time or early. What I am trying to do is use an Expression in the Chart Series Labels "Visible" property to hide the label if it is 0 on the chat. Of note in the table this value is returned as 0.00 I have tried using various SWITCH and IFF Statements to do this but nothing seems to work and its likely I am getting the syntax wrong, can anyone help?
Table Values
TotalIssued Early Late OnTime EarlyPerc LatePerc OnTimePerc
6, 0, 4, 2, 0.00, 66.67, 33.33,
=SWITCH(
(First(Fields!EarlyPerc.Value, "EstimatesIssued") = 0),false,
(First(Fields!LatePerc.Value, "EstimatesIssued") = 0),false,
(First(Fields!OnTimePerc.Value, "EstimatesIssued") = 0),false,
true)
Thanks
Try:
=SWITCH(
First(Fields!EarlyPerc.Value, "EstimatesIssued") = 0,false,
First(Fields!LatePerc.Value, "EstimatesIssued") = 0,false,
First(Fields!OnTimePerc.Value, "EstimatesIssued") = 0,false,
true,true)
UPDATE:
If you have one field per percentage and your dataset returns one row always, you have to select each serie in the ChartData window and press F4 to see properties window.
In properties window set for EarlyPerc Visible property:
=IIF(Fields!EarlyPerc.Value=0,False,True)
And so on for the next two series you have (LatePerc and OnTimePerc).
Let me know if this helps.

Adding the results of multiple functions

Using python 3.3
Stumbled upon another problem with my program. Its the same solar program. Again i decided to add more functionality. Its basically ad-hoc. I'm adding things as i go along. I realize it can be made more efficient but once i decide its done, I'll post the whole coding up.
Anyway, i need to add the results from multiple functions. Here's a part of my coding:
def janCalc():
for a in angle(0,360,10): #angle of orientation
for d in days(1,32,1.0006630137): #day number of year
for smodule in equation(): #equation() function not shown in this coding
total_jan+=smodule #total_jan is already defined elsewhere
avg_jan=total_jan/(60*(1.0006630137*31))
ratio_jan=avg_jan/5.67
calcJan=(ratio_jan*4.79)
yield calcJan
total_jan=0 #necessary to reset total to 0 for next angle interval
def febCalc():
for a in angle(0,360,10):
for d in days ((1.0006630137*31),61,1.0006630137):
for smodule in equation():
total_feb+=smodule
avg_feb=total_feb/(60*(1.0006630137*28))
ratio_feb=avg_feb/6.56
calcFeb=(ratio_feb*4.96)
yield calcFeb
total_feb=0
#etc..............
Is there anyway to add the yield of each function?
for e.g: calcJan+calcFeb+.....
I would like to get the total results under each angle interval and then dividing by 12 to get the average value per interval. Like so:-
0 degrees---->total/12
10 deg ---->total/12
20 deg ---->total/12
30 deg ---->total/12
........
360 deg ---->total/12
If you need more info, let me know.
ADDENDUM
The solution was essentially solved by #jonrsharpe. But i encountered a bit of a problem.
Traceback (most recent call last):
File "C:\Users\User\Documents\Python\Solar program final.py", line 247, in <module>
output=[sum(vals)/12 for vals in zip(*(gen() for gen in months))]
File "C:\Users\User\Documents\Python\Solar program final.py", line 247, in <listcomp>
output=[sum(vals)/12 for vals in zip(*(gen() for gen in months))]
File "C:\Users\User\Documents\Python\Solar program final.py", line 103, in janCalc
for smodule in equation():
File "C:\Users\User\Documents\Python\Solar program final.py", line 63, in equation
d=math.asin(math.sin(math.radians(23.45))*math.sin(math.radians((360/365.242)*(d-81))))
NameError: global name 'd' is not defined
I've isolated it to:
for d in days ((1.0006630137*31),61,1.0006630137):
for smodule in equation():
It turns out i can't reference a function from inside a function? I'm not too sure. So even my original coding did not work. I assumed it was working because previously i had not defined each month as a function. I should have tested it out first.
Do you know how to get around this?
A simple example to demonstrate how to combine multiple generators:
>>> def gen1():
for x in range(5):
yield x
>>> def gen2():
for x in range(5, 10):
yield x
>>> [sum(vals) for vals in zip(*(gen() for gen in (gen1, gen2)))]
[5, 7, 9, 11, 13]
Or, written out long hand:
output = list(gen1())
for index, value in enumerate(gen2()):
output[index] += value
You can modify either version to include a division, too, so your case would look something like:
months = [janCalc, fabCalc, ...]
output = [sum(vals) / 12 for vals i zip(*(gen() for gen in months))]