How to create a scikit learn dataset? - csv

I have an array where the first columns are classes (in integer form), and the rest of the columns are features.
SG like this
1,0,34,23,2
0,0,21,11,0
3,11,2,11,1
How can I turn this into a scikit compatible dataset, so I can call sg like
mydataset = datasets.load_mydataset()?

You can simply use pandas. e.g. If you have copied your dataset to dataset.csv file. Just label the columns in csv file appropriately.
In [1]: import pandas as pd
In [2]: df = pd.read_csv('temp.csv')
In [3]: df
Out[3]:
Label f1 f2 f3 f4
0 1 0 34 23 2
1 0 0 21 11 0
2 3 11 2 11 1
In [4]: y_train= df['Label']
In [5]: x_train = df.drop('Label', axis=1)
In [6]: x_train
Out[6]:
f1 f2 f3 f4
0 0 34 23 2
1 0 21 11 0
2 11 2 11 1
In [7]: y_train
Out[7]:
0 1
1 0
2 3

Related

statsmodels OLS gives parameters despite perfect multicollinearity

Assume the following df:
ib c d1 d2
0 1.14 1 1 0
1 1.0 1 1 0
2 0.71 1 1 0
3 0.6 1 1 0
4 0.66 1 1 0
5 1.0 1 1 0
6 1.26 1 1 0
7 1.29 1 1 0
8 1.52 1 1 0
9 1.31 1 1 0
10 0.89 1 0 1
d1 and d2 are perfectly colinear. Now I estimate the following regression model:
import statsmodels.api as sm
reg = sm.OLS(df['ib'], df[['c', 'd1', 'd2']]).fit().summary()
reg
This gives me the following output:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: ib R-squared: 0.087
Model: OLS Adj. R-squared: -0.028
Method: Least Squares F-statistic: 0.7590
Date: Thu, 17 Nov 2022 Prob (F-statistic): 0.409
Time: 12:19:34 Log-Likelihood: -1.5470
No. Observations: 10 AIC: 7.094
Df Residuals: 8 BIC: 7.699
Df Model: 1
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
c 0.7767 0.111 7.000 0.000 0.521 1.033
d1 0.2433 0.127 1.923 0.091 -0.048 0.535
d2 0.5333 0.213 2.499 0.037 0.041 1.026
==============================================================================
Omnibus: 0.257 Durbin-Watson: 0.760
Prob(Omnibus): 0.879 Jarque-Bera (JB): 0.404
Skew: 0.043 Prob(JB): 0.817
Kurtosis: 2.019 Cond. No. 8.91e+15
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.34e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
"""
However, including c, d1 and d2 represents the well known dummy variable trap which, from my understanding, should make it impossible to estimate the model. Why is this not the case here?

which post-hoc test after welch-anova

i´m doing the statistical evaluation for my master´s thesis. the levene test was significant so i did the welch anova which was significant. now i tried the games-howell post hoc test but it didn´t work.
can anybody help me sending me the exact functions which i have to run in R to do the games-howell post hoc test and to get kind of a compact letter display, where it shows me which treatments are not significantly different from each other? i also wanted to ask if i did the welch anova the right way (you can find the output of R below)
here it the output which i did till now for the statistical evalutation:
data.frame': 30 obs. of 3 variables:
$ Dauer: Factor w/ 6 levels "0","2","4","6",..: 1 2 3 4 5 6 1 2 3 4 ...
$ WH : Factor w/ 5 levels "r1","r2","r3",..: 1 1 1 1 1 1 2 2 2 2 ...
$ TSO2 : num 107 86 98 97 88 95 93 96 96 99 ...
> leveneTest(TSO2~Dauer, data=TSO2R)
`Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 5 3.3491 0.01956 *
24
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1`
`> oneway.test (TSO2 ~Dauer, data=TSO2R, var.equal = FALSE) ###Welch-ANOVA
One-way analysis of means (not assuming equal variances)
data: TSO2 and Dauer
F = 5.7466, num df = 5.000, denom df = 10.685, p-value = 0.00807
'''`
Thank you very much!

How to view skipped records in pandas read_csv()? [duplicate]

I have a list of skip rows ( say [1,5,10] --> row numbers) and when I passed this to pandas read_csv, it ignores those rows. But, I need to save these skipped rows in a different text file.
I went through pandas read_csv documentation and few other articles, but have no idea how to save this into a text file.
Example :
Input file :
a,b,c
# Some Junk to Skip 1
4,5,6
# Some junk to skip 2
9,20,9
2,3,4
5,6,7
Code :
skiprows = [1,3]
df = pandas.read_csv(file, skip_rows = skiprows)
Now output.txt :
# Some junk to skip 1
# Some junk to skip 2
Thanks in advance!
def write_skiprows(infile, skiprows, outfile='skiprows.csv')
maxrow = max(skiprows)
with open(infile, 'r') as f, open(outfile, 'w') as o:
for i, line in enumerate(f):
if i in skiprows:
o.write(line)
if i == maxrow:
return
try this,
df=pd.read_csv('input.csv')
skiprows=[1,3,6]
df,df_skiprow=df.drop(skiprows),df.iloc[skiprows]
#df_skiprow.to_csv('skiprows.csv',index=False)
Input:
a b
0 1 c1
1 2 c2
2 3 c3
3 4 c4
4 5 c5
5 6 c6
6 7 c7
7 8 c8
8 9 c9
9 10 c10
Output:
df
a b
0 1 c1
2 3 c3
4 5 c5
5 6 c6
7 8 c8
8 9 c9
9 10 c10
df_skiprow
a b
1 2 c2
3 4 c4
6 7 c7
Explanation:
read whole file.
split file by df and skiprow
convert into seperate csv file.

Print HTML from raw print tables in pandas/ jupyter

Using the code from:
Pandas: cannot import name adjoin
I get print out below. Can I change the output into a an HTML layout easily.
def side_by_side(*objs, **kwds):
from pandas.io.formats.printing import adjoin, pprint_thing
space = kwds.get('space', 6)
reprs = [repr(obj).split('\n') for obj in objs]
print adjoin(space, *reprs)
import pandas as pd
df1 = pd.DataFrame(np.random.rand(10,3))
df2 = pd.DataFrame(np.random.rand(10,3))
side_by_side(df1, df2)
0 1 2 0 1 2
0 0.786732 0.688221 0.339926 0 0.624153 0.611812 0.933379
1 0.444541 0.366336 0.840466 1 0.734519 0.824821 0.335849
2 0.328322 0.322575 0.935291 2 0.907465 0.185209 0.407982
3 0.919987 0.968674 0.807549 3 0.737452 0.333456 0.886134
4 0.086916 0.090911 0.557082 4 0.860656 0.165118 0.230746
5 0.856184 0.884198 0.636849 5 0.052435 0.858721 0.339225
6 0.955805 0.151886 0.221581 6 0.393247 0.270365 0.123228
7 0.332495 0.256805 0.312205 7 0.456939 0.234717 0.563153
8 0.118446 0.375340 0.029774 8 0.202765 0.511387 0.948326
9 0.537782 0.945828 0.445125 9 0.371834 0.954219 0.057206
Panda now provides a to_html() method.
You can use it like:
df.to_html()
Here the official doc

Select records based on the specific index string value and then remove subsequent fields by python

I have a .csv file named file01.csv that contains many records. Some records are required and some are not. I find that the required records has a string variable “Mi”, but it is not exist into the unnecessary records. So, I want to select the required records based on string value “Mi” in the field for every records.
Finally I want to delete the subsequent fields of each record from the field that contains value “Mi”. Any suggestion and advice is appreciated.
Optional:
In addition, I want to delete the first column.
Split column BB into two column named as a_id, and c_id. Separate the value by _ (underscore) and left side will go to a_id, and right side will go to c_id.
My fileO.csv is as follows:
AA BB CC DD EE FF GG
1 1_1.csv (=0 =10" 27" =57 "Mi"
0.97 0.9 0.8 NaN 0.9 od 0.2
2 1_3.csv (=0 =10" 27" "Mi" 0.5
0.97 0.5 0.8 NaN 0.9 od 0.4
3 1_6.csv (=0 =10" "Mi" =53 cnt
0.97 0.9 0.8 NaN 0.9 od 0.6
4 2_6.csv No Bi 000 000 000 000
5 2_8.csv No Bi 000 000 000 000
6 6_9.csv less 000 000 000 000
7 7_9.csv s(=0 =26" =46" "Mi" 121
My Expected results files (outFile.csv):
a_id b_id CC DD EE FF GG
1 1 0 10 27 57
1 3 0 10 27
1 6 0 10
7 9 0 26 46
The following approach should work fine using Python csv module:
import csv
import re
import string
output_header = ['a_id', 'b_id', 'CC', 'DD', 'EE', 'FF', 'GG']
sanitise_table = string.maketrans("","")
nodigits_table = sanitise_table.translate(sanitise_table, string.digits)
def find_mi(row):
for index, col in enumerate(row):
if col.find('Mi') != -1:
return index
return -1
def sanitise_cell(cell):
return cell.translate(sanitise_table, nodigits_table) # Keep digits
f_input = open('fileO.csv', 'rb')
f_output = open('outFile.csv', 'wb')
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
input_header = next(f_input)
csv_output.writerow(output_header)
for row in csv_input:
#print '%2d %s' % (len(row), row)
if len(row) >= 2:
bb = re.match(r'(\d+)__(\d+).0\.csv', row[1])
mi = find_mi(row)
if bb and mi != -1:
row[:] = row[:mi] + [''] * (len(row) - mi)
row[:] = [sanitise_cell(col) for col in row]
row[0] = bb.group(1)
row[1] = bb.group(2)
csv_output.writerow(row)
f_input.close()
f_output.close()
outFile.csv will contain the following:
a_id,b_id,CC,DD,EE,FF,GG
1,1,0,10,27,57,
1,3,0,10,27,,
1,6,0,10,,,
7,9,0,26,46,,
Tested using Python 2.6.6