Assume the following df:
ib c d1 d2
0 1.14 1 1 0
1 1.0 1 1 0
2 0.71 1 1 0
3 0.6 1 1 0
4 0.66 1 1 0
5 1.0 1 1 0
6 1.26 1 1 0
7 1.29 1 1 0
8 1.52 1 1 0
9 1.31 1 1 0
10 0.89 1 0 1
d1 and d2 are perfectly colinear. Now I estimate the following regression model:
import statsmodels.api as sm
reg = sm.OLS(df['ib'], df[['c', 'd1', 'd2']]).fit().summary()
reg
This gives me the following output:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: ib R-squared: 0.087
Model: OLS Adj. R-squared: -0.028
Method: Least Squares F-statistic: 0.7590
Date: Thu, 17 Nov 2022 Prob (F-statistic): 0.409
Time: 12:19:34 Log-Likelihood: -1.5470
No. Observations: 10 AIC: 7.094
Df Residuals: 8 BIC: 7.699
Df Model: 1
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
c 0.7767 0.111 7.000 0.000 0.521 1.033
d1 0.2433 0.127 1.923 0.091 -0.048 0.535
d2 0.5333 0.213 2.499 0.037 0.041 1.026
==============================================================================
Omnibus: 0.257 Durbin-Watson: 0.760
Prob(Omnibus): 0.879 Jarque-Bera (JB): 0.404
Skew: 0.043 Prob(JB): 0.817
Kurtosis: 2.019 Cond. No. 8.91e+15
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.34e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
"""
However, including c, d1 and d2 represents the well known dummy variable trap which, from my understanding, should make it impossible to estimate the model. Why is this not the case here?
I have designed a multi-task network where the first layers are shared between two output layers. Through investigating multi-task learning principles, I got to know that there should be a weight scalar parameter such as alpha that dampens the two losses outputted from two output layers. My question is about this parameter itself. Does it have effect on the model's final performance? probably yes.
This is the part of my code snippet for computation of losses:
...
mtl_loss = (alpha) * loss_1 + (1-alpha) * loss_2
mtl_loss.backward()
...
Above, loss_1 is MSELoss, and loss_2 is CrossEntropyLoss. As such, picking alpha=0.9, I'm getting the following loss values during training steps:
[2020-05-03 04:46:55,398 INFO] Step 50/150000; loss_1: 0.90 + loss_2: 1.48 = mtl_loss: 2.43 (RMSE: 2.03, F1score: 0.07); lr: 0.0000001; 29 docs/s; 28 sec
[2020-05-03 04:47:23,238 INFO] Step 100/150000; loss_1: 0.40 + loss_2: 1.27 = mtl_loss: 1.72 (RMSE: 1.38, F1score: 0.07); lr: 0.0000002; 29 docs/s; 56 sec
[2020-05-03 04:47:51,117 INFO] Step 150/150000; loss_1: 0.12 + loss_2: 1.19 = mtl_loss: 1.37 (RMSE: 0.81, F1score: 0.08); lr: 0.0000003; 29 docs/s; 84 sec
[2020-05-03 04:48:19,034 INFO] Step 200/150000; loss_1: 0.04 + loss_2: 1.10 = mtl_loss: 1.20 (RMSE: 0.55, F1score: 0.07); lr: 0.0000004; 29 docs/s; 112 sec
[2020-05-03 04:48:46,927 INFO] Step 250/150000; loss_1: 0.02 + loss_2: 0.96 = mtl_loss: 1.03 (RMSE: 0.46, F1score: 0.08); lr: 0.0000005; 29 docs/s; 140 sec
[2020-05-03 04:49:14,851 INFO] Step 300/150000; loss_1: 0.02 + loss_2: 0.99 = mtl_loss: 1.05 (RMSE: 0.43, F1score: 0.08); lr: 0.0000006; 29 docs/s; 167 sec
[2020-05-03 04:49:42,793 INFO] Step 350/150000; loss_1: 0.02 + loss_2: 0.97 = mtl_loss: 1.04 (RMSE: 0.43, F1score: 0.08); lr: 0.0000007; 29 docs/s; 195 sec
[2020-05-03 04:50:10,821 INFO] Step 400/150000; loss_1: 0.01 + loss_2: 0.94 = mtl_loss: 1.00 (RMSE: 0.41, F1score: 0.08); lr: 0.0000008; 29 docs/s; 223 sec
[2020-05-03 04:50:38,943 INFO] Step 450/150000; loss_1: 0.01 + loss_2: 0.86 = mtl_loss: 0.92 (RMSE: 0.40, F1score: 0.08); lr: 0.0000009; 29 docs/s; 252 sec
As training loss shows, it seems that my first network that uses MSELoss converges super fast, while the second network has not been converged yet. RMSE, and F1score are two metrics that I'm using to track the progress of first, and second network, respectively.
I know that picking the optimal alpha is somewhat experimental, but are there hints to make the process of picking it easier? Specifically, I want the networks being trained in line with each other, not like above that the first network converges super duper fast. Can alpha parameter help controlling this?
With that alpha, loss_1 is contributing more to the result and due backpropagation updates weights proportionally to error it improves faster. Try using more equilibrated alpha to balance the performance in both tasks.
You also can try change alpha during training.
I'm using MySQL to calculate returns for my portfolio. So, I have a table for portfolios, the holding period is 6 months say:
table Portfolio
DATE_ TCIKER WEIGHT
2007-01-31 AAPL 0.2
2007-01-31 IBM 0.2
2007-01-31 FB 0.3
2007-01-31 MMM 0.3
2007-07-31 AAPL 0.1
2007-07-31 FB 0.8
2007-07-31 AMD 0.1
... ... ...
And I have a monthly stat table for these companies(the whole universe of stocks) including monthly returns:
table stats
DATE_ TICKER RETURN OTHER_STATS
2007-01-31 AAPL 0.01 ...
2007-01-31 IBM 0.03 ...
2007-01-31 FB 0.13 ...
2007-01-31 MMM -0.07 ...
2007-02-31 AAPL 0.03 ...
2007-02-31 IBM 0.04 ...
2007-02-31 FB 0.06 ...
2007-02-31 MMM -0.10 ...
I'm re-balancing the portfolio every 6 month. So during these 6 months, the weights of each stock won't change. What I want to get is something like this:
ResultTable
DATE_ TICKER RETURN OTHER_STATS WEIGHT
2007-01-31 AAPL 0.01 ... 0.2
2007-01-31 IBM 0.03 ... 0.2
2007-01-31 FB 0.13 ... 0.3
2007-01-31 MMM -0.07 ... 0.3
2007-02-31 AAPL 0.03 ... 0.2
2007-02-31 IBM 0.04 ... 0.2
2007-02-31 FB 0.06 ... 0.3
2007-02-31 MMM -0.10 ... 0.3
2007-03-31 AAPL 0.03 ... 0.2
2007-03-31 IBM 0.14 ... 0.2
2007-03-31 FB 0.16 ... 0.3
2007-03-31 MMM -0.06 ... 0.3
... ... ... ... ...
2007-07-31 AAPL ... ... 0.1
2007-07-31 FB ... ... 0.8
2007-07-31 AMD ... ... 0.1
2007-08-31 AAPL ... ... 0.1
2007-08-31 FB ... ... 0.8
2007-08-31 AMD ... ... 0.1
I tired
select s.*, p.WEIGHT from portfolio p
left join stats s
on p.DATE_ = s.DATE_
and p.TICKER= s.TICKER;
It would only give me the dates of my portfolio re-balance date.
Is there any efficient way to calculate the monthly returns?
This might work, if I understand you formula:
SELECT
p.`DATE_`,
p.`TICKER`,
SUM(s.`RETURN` * p.`WEIGHT`) as `return`,
p.WEIGHT
FROM `portfolio` p
LEFT JOIN `stats` s
ON p.`TICKER` = s.`TICKER`
WHERE s.`DATE_` BETWEEN p.`DATE_` AND DATE_ADD(DATE_ADD(p.`DATE_`, INTERVAL 6 MONTHS),INTERVAL -1 DAY)
GROUP BY p.`DATE_`, p.`TICKER`
ORDER BY p.`DATE_`, p.`TICKER`;
I have multiple .csv files and I want to concatenate them into one file. Essentially I would like to choose certain columns and append them side by side.
This code I have here doesn't work. No error message at all. It just does nothing.
Does anybody know how to fix it?
import pandas as pd
import datetime
import numpy as np
import glob
import csv
import os
def concatenate(indir='/My Documents/Python/Test/in',
outfile='/My Documents/Python/Test/out/Forecast.csv'):
os.chdir(indir)
fileList = glob.glob('*.csv')
print(fileList)
dfList = []
colnames=["DateTime","WindSpeed","Capacity","p0.025","p0.05","p0.1","p0.5","p0.9","p0.95","p0.975","suffix"]
for filename in fileList:
print(filename)
df = pd.read_csv(filename ,delimiter=',',engine = 'python', encoding='latin-1', index_col = False)
dfList.append(df)
concatDF = pd.concat(dfList,axis=0)
concatDF.columns=colnames
concatDF.to_csv(outfile,index=None)
I ran this code to set up files on my file system
setup
import pandas as pd
import numpy as np
def setup_test_files(indir='in'):
colnames = [
"WindSpeed", "Capacity",
"p0.025", "p0.05", "p0.1", "p0.5",
"p0.9", "p0.95", "p0.975", "suffix"
]
tidx = pd.date_range('2016-03-31', periods=3, freq='M', name='DateTime')
for filename in ['in/fn_{}.csv'.format(i) for i in range(3)]:
pd.DataFrame(
np.random.rand(3, len(colnames)),
tidx, colnames
).round(2).to_csv(filename)
print(filename)
setup_test_files()
This created 3 files named ['fn_0.csv', 'fn_1.csv', 'fn_2.csv']
They look like this
with open('in/fn_0.csv', 'r') as fo:
print(''.join(fo.readlines()))
DateTime,WindSpeed,Capacity,p0.025,p0.05,p0.1,p0.5,p0.9,p0.95,p0.975,suffix
2016-03-31,0.03,0.76,0.62,0.21,0.76,0.36,0.44,0.61,0.23,0.04
2016-04-30,0.39,0.12,0.31,0.99,0.86,0.35,0.15,0.61,0.55,0.03
2016-05-31,0.72,1.0,0.71,0.86,0.41,0.79,0.22,0.76,0.92,0.79
I'll define a parser function and one that does the concatenation separately. Why? Because I think it's easier to follow that way.
import pandas as pd
import glob
import os
def read_csv(fn):
colnames = [
"DateTime", "WindSpeed", "Capacity",
"p0.025", "p0.05", "p0.1", "p0.5",
"p0.9", "p0.95", "p0.975", "suffix"
]
df = pd.read_csv(fn, encoding='latin-1')
df.columns = colnames
return df
def concatenate(indir='in', outfile='out/Forecast.csv'):
curdir = os.getcwd()
try:
os.chdir(indir)
file_list = glob.glob('*.csv')
df_names = [fn.replace('.csv', '') for fn in file_list]
concat_df = pd.concat(
[read_csv(fn) for fn in file_list],
axis=1, keys=df_names)
# notice I was nice enough to change directory back :-)
os.chdir(curdir)
concat_df.to_csv(outfile, index=None)
except:
os.chdir(curdir)
Then run concatenation
concatenate()
You can read in the results like this
print(pd.read_csv('out/Forecast.csv', header=[0, 1]))
fn_0 \
DateTime WindSpeed Capacity p0.025 p0.05 p0.1 p0.5 p0.9 p0.95 p0.975
0 2016-03-31 0.03 0.76 0.62 0.21 0.76 0.36 0.44 0.61 0.23
1 2016-04-30 0.39 0.12 0.31 0.99 0.86 0.35 0.15 0.61 0.55
2 2016-05-31 0.72 1.00 0.71 0.86 0.41 0.79 0.22 0.76 0.92
... fn_2
... WindSpeed Capacity p0.025 p0.05 p0.1 p0.5 p0.9 p0.95 p0.975 suffix
0 ... 0.80 0.79 0.38 0.94 0.91 0.18 0.27 0.14 0.39 0.91
1 ... 0.60 0.97 0.04 0.69 0.04 0.65 0.94 0.81 0.37 0.22
2 ... 0.78 0.53 0.83 0.93 0.92 0.12 0.15 0.65 0.06 0.11
[3 rows x 33 columns]
Notes:
You aren't taking care to make DateTime your index. I think this is probably what you want. If so, change the read_csv and concatenate functions to this
import pandas as pd
import glob
import os
def read_csv(fn):
colnames = [
"WindSpeed", "Capacity",
"p0.025", "p0.05", "p0.1", "p0.5",
"p0.9", "p0.95", "p0.975", "suffix"
]
# notice extra parameters for specifying index and parsing dates
df = pd.read_csv(fn, index_col=0, parse_dates=[0], encoding='latin-1')
df.index.name = "DateTime"
df.columns = colnames
return df
def concatenate(indir='in', outfile='out/Forecast.csv'):
curdir = os.getcwd()
try:
os.chdir(indir)
file_list = glob.glob('*.csv')
df_names = [fn.replace('.csv', '') for fn in file_list]
concat_df = pd.concat(
[read_csv(fn) for fn in file_list],
axis=1, keys=df_names)
os.chdir(curdir)
concat_df.to_csv(outfile)
except:
os.chdir(curdir)
This is what final result looks like with this change, notice the dates will be aligned this way
fn_0 \
WindSpeed Capacity p0.025 p0.05 p0.1 p0.5 p0.9 p0.95 p0.975
DateTime
2016-03-31 0.03 0.76 0.62 0.21 0.76 0.36 0.44 0.61 0.23
2016-04-30 0.39 0.12 0.31 0.99 0.86 0.35 0.15 0.61 0.55
2016-05-31 0.72 1.00 0.71 0.86 0.41 0.79 0.22 0.76 0.92
... fn_2 \
suffix ... WindSpeed Capacity p0.025 p0.05 p0.1 p0.5 p0.9
DateTime ...
2016-03-31 0.04 ... 0.80 0.79 0.38 0.94 0.91 0.18 0.27
2016-04-30 0.03 ... 0.60 0.97 0.04 0.69 0.04 0.65 0.94
2016-05-31 0.79 ... 0.78 0.53 0.83 0.93 0.92 0.12 0.15
p0.95 p0.975 suffix
DateTime
2016-03-31 0.14 0.39 0.91
2016-04-30 0.81 0.37 0.22
2016-05-31 0.65 0.06 0.11
[3 rows x 30 columns]
I have some hundreds of csv that I want insert into MySql database with a R script.
library(RMySQL)
library(caroline)
...
...
con <- dbConnect(MySQL(), user="root", password="mypass", dbname="Data", host="localhost")
fields <- dbListFields(con, db.table.name)
dbWriteTable2(con,db.table.name,data,row.names=FALSE,fill.null=TRUE)
dbDisconnect(con)
...
...
R answer me:
Error in mysqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not run statement: Unknown column 'id' in 'field list')
But there is no id field in any of the two tables, my data frame and mysql table.
So what R is trying to telling me?
Here it is my database schema.
And here the str() of my data
str(data)
'data.frame': 306 obs. of 22 variables:
$ Division : chr "B1" "B1" "B1" "B1" ...
$ Eventdate: Date, format: "2000-08-12" "2000-08-12" "2000-08-12" "2000-08-12" ...
$ HomeTeam : chr "Beveren" "Mechelen" "Louvieroise" "Mouscron" ...
$ AwayTeam : chr "Charleroi" "Genk" "Lokeren" "Germinal" ...
$ FTHG : int 1 0 0 3 2 1 6 3 0 2 ...
$ FTAG : int 2 0 0 0 1 3 2 4 0 1 ...
$ FTR : chr "A" "D" "D" "H" ...
$ HTHG : int 1 0 0 1 1 0 3 1 0 1 ...
$ HTAG : int 1 0 0 0 0 1 0 2 0 0 ...
$ HTR : chr "D" "D" "D" "H" ...
$ GBH : num 2.2 2 2.3 1.9 1.41 2.7 1.8 2.6 4 1.5 ...
$ GBD : num 3.5 3.6 3.3 3.3 4.1 3.3 3.4 3.2 3 3.5 ...
$ GBA : num 2.5 2.65 2.45 3.5 4.7 2.1 3.5 2.2 1.75 5.1 ...
$ IWH : num 2 2.1 2.35 1.8 1.75 2.3 1.6 2.6 3.8 1.4 ...
$ IWD : num 3 3 3 3 3.1 3 3.2 3 3.2 3.6 ...
$ IWA : num 2.9 2.8 2.35 3.5 3.6 2.4 4.2 2.2 1.65 5.5 ...
$ SBH : num 2.05 2.3 2.45 1.85 1.6 2.6 1.6 2.75 4.1 1.53 ...
$ SBD : num 3.4 3.4 3.4 3.4 3.5 3.2 3.5 3.25 3.45 3.5 ...
$ SBA : num 2.95 2.55 2.4 3.5 4.5 2.3 4.5 2.2 1.7 5 ...
$ WHH : num 2 2.4 2.4 -1 1.61 2.4 1.61 2.6 3.8 1.5 ...
$ WHD : num 3.4 3.3 3.3 -1 3.6 3.3 3.75 3.3 3.5 3.6 ...
$ WHA : num 2.9 2.4 2.4 -1 4.2 2.4 4 2.25 1.7 5.2 ...
Based on some testing of package caroline, dbWriteTable2 is only useful for tables that have an id column. From ?dbWriteTable2: it returns If successful, the ids of the newly added database records (invisible). If the table does not have an id column to return, dbWriteTable2 seems to fail.