Custom datetimeparsing to combine date and time after reading csv - Pandas - csv

upon reading text file, I am presented with an odd format, where date and time are contained in separate columns, as follows (files is tabs as separators).
temp
room 1
Date Time simulation
Fri, 01/Jan 00:30 11.94
01:30 12
02:30 12.04
03:30 12.06
04:30 12.08
05:30 12.09
06:30 11.99
07:30 12.01
08:30 12.29
09:30 12.46
10:30 12.35
11:30 12.25
12:30 12.19
13:30 12.12
14:30 12.04
15:30 11.96
16:30 11.9
17:30 11.92
18:30 11.87
19:30 11.79
20:30 12
21:30 12.16
22:30 12.27
23:30 12.3
Sat, 02/Jan 00:30 12.25
01:30 12.19
02:30 12.14
03:30 12.11
etc.
I would like to:
parse date and time over two columns ([0],[1]);
shift all timestamps 30minutes early, that is replacing :30 with :00;
I have used the following code:
timeparse = lambda x: pd.datetime.strptime(x.replace(':30',':00'), '%H:%M')
df = pd.read_csv('Chart_1.txt',
sep='\t',
skiprows=1,
date_parser=timeparse,
parse_dates=['Time'],
header=1)
Which does seem to be parsing time not dates (obviously, as this is what I told it to do).
Also, skipping rows is useful for finding the Date and Time headers, but it discards the headers temp and room 1, that I need.

You can use:
import pandas as pd
df = pd.read_csv('Chart_1.txt', sep='\t')
#get temperature to variable tempfrom third column
temp = df.columns[2]
print (temp)
Dry resultant temperature (°C)
#get aps to variable aps from second row and third column
aps = df.iloc[1, 2]
print (aps)
AE4854c_Campshill_openings reduced_communal areas increased openings2.aps
#create mask from first column - all values contains / - dates
mask = df.iloc[:, 0].str.contains('/',na=False)
#shift all rows to right NOT contain dates
df1 = df[~mask].shift(1, axis=1)
#get rows with dates
df2 = df[mask]
#concat df1 and df2, sort unsorted indexes
df = pd.concat([df1, df2]).sort_index()
#create new column names by assign
#first 3 are custom, other are from first row and fourth to end columns
df.columns = ['date','time','no name'] + df.iloc[0, 3:].tolist()
#remove first 2 row
df = df[2:]
#fill NaN values in column date by forward filling
df.date = df.date.ffill()
#convert column to datetime
df.date = pd.to_datetime(df.date, format='%a, %d/%b')
#replace 30 minutes to 00
df.time = df.time.str.replace(':30', ':00')
print (df.head())
date time no name 3F_T09_SE_SW_Bed1 GF_office_S GF_office_W_tea \
2 1900-01-01 00:00 11.94 11.47 14.72 16.66
3 1900-01-01 01:00 12.00 11.63 14.83 16.69
4 1900-01-01 02:00 12.04 11.73 14.85 16.68
5 1900-01-01 03:00 12.06 11.80 14.83 16.65
6 1900-01-01 04:00 12.08 11.84 14.79 16.62
GF_Act.Room GF_Communal areas GF_Reception GF_Ent Lobby ... \
2 17.41 12.74 12.93 10.85 ...
3 17.45 12.74 13.14 11.00 ...
4 17.44 12.71 13.23 11.09 ...
5 17.41 12.68 13.27 11.16 ...
6 17.36 12.65 13.28 11.21 ...
2F_S01_SE_SW_Bedroom 2F_S01_SE_SW_Int Circ 2F_S01_SE_SW_Storage_int circ \
2 12.58 12.17 12.54
3 12.64 12.22 12.49
4 12.68 12.27 12.48
5 12.70 12.30 12.49
6 12.71 12.31 12.51
GF_G01_SE_SW_Bedroom GF_G01_SE_SW_Storage_Bed 3F_T09_SE_SW_Bathroom \
2 14.51 14.61 11.49
3 14.55 14.59 11.50
4 14.56 14.59 11.52
5 14.55 14.58 11.54
6 14.54 14.57 11.56
3F_T09_SE_SW_Circ 3F_T09_SE_SW_Storage_int circ GF_Lounge GF_Cafe
2 11.52 11.38 12.83 12.86
3 11.56 11.35 13.03 13.03
4 11.61 11.36 13.13 13.13
5 11.65 11.39 13.17 13.17
6 11.68 11.42 13.18 13.18
[5 rows x 31 columns]

Related

statsmodels OLS gives parameters despite perfect multicollinearity

Assume the following df:
ib c d1 d2
0 1.14 1 1 0
1 1.0 1 1 0
2 0.71 1 1 0
3 0.6 1 1 0
4 0.66 1 1 0
5 1.0 1 1 0
6 1.26 1 1 0
7 1.29 1 1 0
8 1.52 1 1 0
9 1.31 1 1 0
10 0.89 1 0 1
d1 and d2 are perfectly colinear. Now I estimate the following regression model:
import statsmodels.api as sm
reg = sm.OLS(df['ib'], df[['c', 'd1', 'd2']]).fit().summary()
reg
This gives me the following output:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: ib R-squared: 0.087
Model: OLS Adj. R-squared: -0.028
Method: Least Squares F-statistic: 0.7590
Date: Thu, 17 Nov 2022 Prob (F-statistic): 0.409
Time: 12:19:34 Log-Likelihood: -1.5470
No. Observations: 10 AIC: 7.094
Df Residuals: 8 BIC: 7.699
Df Model: 1
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
c 0.7767 0.111 7.000 0.000 0.521 1.033
d1 0.2433 0.127 1.923 0.091 -0.048 0.535
d2 0.5333 0.213 2.499 0.037 0.041 1.026
==============================================================================
Omnibus: 0.257 Durbin-Watson: 0.760
Prob(Omnibus): 0.879 Jarque-Bera (JB): 0.404
Skew: 0.043 Prob(JB): 0.817
Kurtosis: 2.019 Cond. No. 8.91e+15
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.34e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
"""
However, including c, d1 and d2 represents the well known dummy variable trap which, from my understanding, should make it impossible to estimate the model. Why is this not the case here?

Append information in the th tags to td rows

I am an economist struggling with coding and data scraping.
I am scarping data from the main and unique table on this webpage (https://www.oddsportal.com/basketball/europe/euroleague-2013-2014/results/). I can retrieve all the information of the td HTML tags with python selenium by referring to the class element. The same goes for the th tag where it is stored the information of the date and stage of the competition. In my final dataset, I would like to have the information stored in the th tag in two rows (data and stage of the competition) next to the other rows in the table. Basically, for each match, I would like to have the date and the stage of the competition in rows and not as the head of each group of matches.
The only solution I came up with is to index all the rows (with both th and td tags) and build a while loop to append the information in the th tags to the td rows whose index is lower than the next index for the th tag. Hope I made myself clear (if not I will try to give a more graphical explanation). However, I am not able to code such a logic construct due to my poor coding abilities. I do not know if I need two loops to iterate through different tags (td and th) and in case how to do that. If you have any easier solution, it is more than welcome!
Thanks in advance for the precious help!
code below:
from selenium import webdriver
import time
import pandas as pd
# Season to filter
seasons_filt = ['2013-2014', '2014-2015', '2015-2016','2016-2017', '2017-2018', '2018-2019']
# Define empty data
data_keys = ["Season", "Match_Time", "Home_Team", "Away_Team", "Home_Odd", "Away_Odd", "Home_Score",
"Away_Score", "OT", "N_Bookmakers"]
data = dict()
for key in data_keys:
data[key] = list()
del data_keys
# Define 'driver' variable and launch browser
#path = "C:/Users/ALESSANDRO/Downloads/chromedriver_win32/chromedriver.exe"
#path office pc
path = "C:/Users/aldi/Downloads/chromedriver.exe"
driver = webdriver.Chrome(path)
# Loop through pages based on page_num and season
for season_filt in seasons_filt:
page_num = 0
while True:
page_num += 1
# Get url and navigate it
page_str = (1 - len(str(page_num)))* '0' + str(page_num)
url ="https://www.oddsportal.com/basketball/europe/euroleague-" + str(season_filt) + "/results/#/page/" + page_str + "/"
driver.get(url)
time.sleep(3)
# Check if page has no data
if driver.find_elements_by_id("emptyMsg"):
print("Season {} ended at page {}".format(season_filt, page_num))
break
try:
# Teams
for el in driver.find_elements_by_class_name('name.table-participant'):
el = el.text.strip().split(" - ")
data["Home_Team"].append(el[0])
data["Away_Team"].append(el[1])
data["Season"].append(season_filt)
# Scores
for el in driver.find_elements_by_class_name('center.bold.table-odds.table-score'):
el = el.text.split(":")
if el[1][-3:] == " OT":
data["OT"].append(True)
el[1] = el[1][:-3]
else:
data["OT"].append(False)
data["Home_Score"].append(el[0])
data["Away_Score"].append(el[1])
# Match times
for el in driver.find_elements_by_class_name("table-time"):
data["Match_Time"].append(el.text)
# Odds
i = 0
for el in driver.find_elements_by_class_name("odds-nowrp"):
i += 1
if i%2 == 0:
data["Away_Odd"].append(el.text)
else:
data["Home_Odd"].append(el.text)
# N_Bookmakers
for el in driver.find_elements_by_class_name("center.info-value"):
data["N_Bookmakers"].append(el.text)
# TODO think of inserting the dates list in the dataframe even if it has a different size (19 rows and not 50)
except:
pass
driver.quit()
data = pd.DataFrame(data)
data.to_csv("data_odds.csv", index = False)
I would like to add this information to my dataset as two additional rows:
for el in driver.find_elements_by_class_name("first2.tl")[1:]:
el = el.text.strip().split(" - ")
data["date"].append(el[0])
data["stage"].append(el[1])
Few things I would change here.
Don't overwrite variables. You store elements in your el variable, then you over write the element with your strings. It may work for you here, but you may get yourself into trouble with that practice later on, especially since you are iterating through those elements. It makes it hard to debug too.
I know Selenium has ways to parse the html. But I personally feel BeautifulSoup is a tad easier to parse with and is a little more intuitive if you are simply just trying to pull out data from the html. So I went with BeautifulSoup's .find_previous() to get the tags that precede the games, essentially then able to get your date and stage content.
Lastly, I like to construct a list of dictionaries to make up the data frame. Each item in the list is a dictionary key:value where the key is the column name and value is the data. You sort of do the opposite in creating a dictionary of lists. Now there is nothing wrong with that, but if the lists don't have the same length, you're get an error when trying to create the dataframe. Where as with my way, if for what ever reason there is a value missing, it will still create the dataframe, but will just have a null or nan for the missing data.
There may be more work you need to do with the code to go through the pages, but this gets you the data in the form you need.
Code:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
import pandas as pd
from bs4 import BeautifulSoup
import re
# Season to filter
seasons_filt = ['2013-2014', '2014-2015', '2015-2016','2016-2017', '2017-2018', '2018-2019']
# Define 'driver' variable and launch browser
path = "C:/Users/ALESSANDRO/Downloads/chromedriver_win32/chromedriver.exe"
driver = webdriver.Chrome(path)
rows = []
# Loop through pages based on page_num and season
for season_filt in seasons_filt:
page_num = 0
while True:
page_num += 1
# Get url and navigate it
page_str = (1 - len(str(page_num)))* '0' + str(page_num)
url ="https://www.oddsportal.com/basketball/europe/euroleague-" + str(season_filt) + "/results/#/page/" + page_str + "/"
driver.get(url)
time.sleep(3)
# Check if page has no data
if driver.find_elements_by_id("emptyMsg"):
print("Season {} ended at page {}".format(season_filt, page_num))
break
try:
soup = BeautifulSoup(driver.page_source, 'html.parser')
table = soup.find('table', {'id':'tournamentTable'})
trs = table.find_all('tr', {'class':re.compile('.*deactivate.*')})
for each in trs:
teams = each.find('td', {'class':'name table-participant'}).text.split(' - ')
scores = each.find('td', {'class':re.compile('.*table-score.*')}).text.split(':')
ot = False
for score in scores:
if 'OT' in score:
ot == True
scores = [x.replace('\xa0OT','') for x in scores]
matchTime = each.find('td', {'class':re.compile('.*table-time.*')}).text
# Odds
i = 0
for each_odd in each.find_all('td',{'class':"odds-nowrp"}):
i += 1
if i%2 == 0:
away_odd = each_odd.text
else:
home_odd = each_odd.text
n_bookmakers = soup.find('td',{'class':'center info-value'}).text
date_stage = each.find_previous('th', {'class':'first2 tl'}).text.split(' - ')
date = date_stage[0]
stage = date_stage[1]
row = {'Season':season_filt,
'Home_Team':teams[0],
'Away_Team':teams[1],
'Home_Score':scores[0],
'Away_Score':scores[1],
'OT':ot,
'Match_Time':matchTime,
'Home_Odd':home_odd,
'Away_Odd':away_odd,
'N_Bookmakers':n_bookmakers,
'Date':date,
'Stage':stage}
rows.append(row)
except:
pass
driver.quit()
data = pd.DataFrame(rows)
data.to_csv("data_odds.csv", index = False)
Output:
print(data.head(15).to_string())
Season Home_Team Away_Team Home_Score Away_Score OT Match_Time Home_Odd Away_Odd N_Bookmakers Date Stage
0 2013-2014 Real Madrid Maccabi Tel Aviv 86 98 False 18:00 -667 +493 7 18 May 2014 Final Four
1 2013-2014 Barcelona CSKA Moscow 93 78 False 15:00 -135 +112 7 18 May 2014 Final Four
2 2013-2014 Barcelona Real Madrid 62 100 False 19:00 +134 -161 7 16 May 2014 Final Four
3 2013-2014 CSKA Moscow Maccabi Tel Aviv 67 68 False 16:00 -278 +224 7 16 May 2014 Final Four
4 2013-2014 Real Madrid Olympiacos 83 69 False 18:45 -500 +374 7 25 Apr 2014 Play Offs
5 2013-2014 CSKA Moscow Panathinaikos 74 44 False 16:00 -370 +295 7 25 Apr 2014 Play Offs
6 2013-2014 Olympiacos Real Madrid 71 62 False 18:45 +127 -152 7 23 Apr 2014 Play Offs
7 2013-2014 Maccabi Tel Aviv Olimpia Milano 86 66 False 17:45 -217 +179 7 23 Apr 2014 Play Offs
8 2013-2014 Panathinaikos CSKA Moscow 73 72 False 16:30 -106 -112 7 23 Apr 2014 Play Offs
9 2013-2014 Panathinaikos CSKA Moscow 65 59 False 18:45 -125 +104 7 21 Apr 2014 Play Offs
10 2013-2014 Maccabi Tel Aviv Olimpia Milano 75 63 False 18:15 -189 +156 7 21 Apr 2014 Play Offs
11 2013-2014 Olympiacos Real Madrid 78 76 False 17:00 +104 -125 7 21 Apr 2014 Play Offs
12 2013-2014 Galatasaray Barcelona 75 78 False 17:00 +264 -333 7 20 Apr 2014 Play Offs
13 2013-2014 Olimpia Milano Maccabi Tel Aviv 91 77 False 18:45 -286 +227 7 18 Apr 2014 Play Offs
14 2013-2014 CSKA Moscow Panathinaikos 77 51 False 16:15 -303 +247 7 18 Apr 2014 Play Offs

How can I sort output from variable?

I want to be able to sort an input csv file that is comma separated by a values created in an extra column. Below is a sample of the input csv file
Timestamp,Email,Name,Year,Make,Model,Car_ID,Judge_ID,Judge_Name,Racer_Turbo,Racer_Supercharged,Racer_Performance,Racer_Horsepower,Car_Overall,Engine_Modifications,Engine_Performance,Engine_Chrome,Engine_Detailing,Engine_Cleanliness,Body_Frame_Undercarriage,Body_Frame_Suspension,Body_Frame_Chrome,Body_Frame_Detailing,Body_Frame_Cleanliness,Mods_Paint,Mods_Body,Mods_Wrap,Mods_Rims,Mods_Interior,Mods_Other,Mods_ICE,Mods_Aftermarket,Mods_WIP,Mods_Overall
8/5/2018 14:10,honoland13#japanpost.jp,Hernando,2015,Acura,TLX,48,J04,Bob,0,0,2,2,4,4,0,2,4,4,2,4,2,2,2,2,2,0,4,4,4,6,2,0,4
8/5/2018 15:11,nlighterness2q#umn.edu,Noel,2015,Jeep,Wrangler,124,J02,Carl,0,6,4,2,4,6,6,4,4,4,6,6,6,6,6,4,6,6,6,6,6,4,6,4,6
8/5/2018 17:10,eguest47#microsoft.com,Edan,2015,Lexus,Is250,222,J05,Adrian,0,0,0,0,0,0,0,0,6,6,6,0,0,6,6,6,0,0,0,0,0,0,0,0,4
8/5/2018 17:34,hchilley40#fema.gov,Hieronymus,1993,Honda,Civic eG,207,J06,Aaron,0,0,2,2,2,2,2,2,0,4,2,2,2,2,2,2,4,2,2,0,0,0,2,2,0
8/5/2018 14:30,nnowick3d#tuttocitta.it,Nickolas,2016,Ford,Mystang,167,J02,Carl,0,0,2,2,0,2,2,0,0,0,0,2,0,2,2,2,0,0,2,0,0,0,0,0,2
8/5/2018 16:12,mdearl39#amazon.co.uk,Martin,2013,Hyundai,Gen coupe,159,J04,Bob,0,0,2,0,0,0,2,0,0,0,0,2,0,2,2,0,2,0,2,0,0,0,0,0,0
8/5/2018 17:00,alynamg#blogtalkradio.com,Aldridge,2009,Infiniti,G37,20,J06,Aaron,2,0,2,2,0,0,2,0,0,2,2,2,2,2,2,2,2,2,4,2,2,0,2,0
What my code currently does is sift through the csv file, and pick out the car_id column, year, make, and model columns. Then it runs through every column from racer_turbo to the last, and for each row it adds up the values in those columns into a total value and prints that along side the other values (id, make, model, etc.). There is also a ranking column that precedes the other 5 when printed. Here is my code below.
BEGIN {
FS = ",";
OFS = "\t";
print "Ranking", "Car_ID", "Year", "Make", "Model", "Total";
}
{
rank;
total = 0;
if(NR > 1) {
for(i = 8; i < NF; i++) {
total += $i;
}
print ++rank,$7, $4, $5, $6, total;
}
rows[$5][total][$0]
}
END {
print "\n";
print "Ranking", "Car_ID", "Year", "Make", "Model", "Total";
ranking;
PROCINFO["sorted_in"] = "#ind_str_asc"
for (m in rows) {
n = asorti(rows[m], t, "#ind_num_desc");
n = (n>3) ? 3 : n
for(i = 1; i <= n; i++) for(s in rows[m][t[i]]) {
$0 = s;
$1 = ++r;
print ++ranking, $7, $4, $5, $6, total;
}
}
}
What I would like to do in the END block is print the output again, however, rank the cars by top three from each make using the total column which was created in the preceding block of the code. However, what I run my code now the output looks as follows
Ranking Car_ID Year Make Model Total
1 48 2015 Acura TLX 58
2 124 2015 Jeep Wrangler 118
3 222 2015 Lexus Is250 36
4 207 1993 Honda Civic eG 40
5 167 2016 Ford Mystang 18
6 159 2013 Hyundai Gen coupe 14
7 20 2009 Infiniti G37 36
...
Ranking Car_ID Year Make Model Total
1 113 2012 Acura Tsx sportwagon 10
2 112 2008 Acura TL 10
3 50 2015 Acura TLX 10
4 15 2014 Audi S4 10
5 18 2015 Audi S3 10
6 116 2008 Audi A4 10
7 2 2016 Bmw M2 10
8 172 2014 Bmw 4 10
9 28 1995 Bmw 318xi 10
...
See how in the total column on the second printed section it shows total is 10 for each printed car, instead of being the same values as they were in the first printed section for each respective car, and the highest 3 totals for each make being displayed.
Below is the expected output
Ranking Car_ID Year Make Model Total
1 48 2015 Acura TLX 58
2 124 2015 Jeep Wrangler 118
3 222 2015 Lexus Is250 36
4 207 1993 Honda Civic eG 40
5 167 2016 Ford Mystang 18
6 159 2013 Hyundai Gen coupe 14
7 20 2009 Infiniti G37 36
8 178 2009 Honda Oddesy 66
...
Ranking Car_ID Year Make Model Total
1 112 2008 Acura TL 110
2 50 2015 Acura TLX 102
3 127 2013 Acura Tsx 86
4 15 2014 Audi S4 120
5 18 2015 Audi S3 38
6 116 2008 Audi A4 28
7 2 2016 Bmw M2 24
8 172 2014 Bmw 4 22
9 111 2007 Bmw 328i 10
10 218 2010 Chevy Camaro 64
11 170 2014 Chevy Cruze 50
12 0 2015 Chevy Camaro 0
...
Is this salvagable with my current code? Or would a better approach be to create a separate awk file that will sort through the generated output and produce another file that is sorted by the top 3?
I'm running GNU AWK v4.0.2.
Assuming the Car_ID (hereinafter referred to as id) is unique across the rows, would you please try:
BEGIN {
FS = ","
OFS = "\t"
print "Ranking", "Car_ID", "Year", "Make", "Model", "Total"
}
{
rank
total = 0
if (NR > 1) {
for (i = 8; i < NF; i++) {
total += $i
}
print ++rank, $7, $4, $5, $6, total
ttl[$5][$7] = total
row[$7] = $0
}
}
END {
print "\n"
print "Ranking", "Car_ID", "Year", "Make", "Model", "Total"
ranking
id
PROCINFO["sorted_in"] = "#ind_str_asc"
for (m in ttl) {
n = asorti(ttl[m], t, "#val_num_desc")
n = (n>3) ? 3 : n
for (i = 1; i <= n; i++) {
id = t[i]
total = ttl[m][id]
$0 = row[id]
print ++ranking, $7, $4, $5, $6, total
}
}
}
I have slightly modified the data structure, assigning the id as the
main key. Then created a 2-D array ttl, which holds the value total
keyed by make and id. In the END loop, we can retrieve the
input data using the id.
As a side note, your original data structure uses total as an index.
If multiple rows with the same make happen to have the same value
of total, either of the indexes will be overwritten.

which post-hoc test after welch-anova

i´m doing the statistical evaluation for my master´s thesis. the levene test was significant so i did the welch anova which was significant. now i tried the games-howell post hoc test but it didn´t work.
can anybody help me sending me the exact functions which i have to run in R to do the games-howell post hoc test and to get kind of a compact letter display, where it shows me which treatments are not significantly different from each other? i also wanted to ask if i did the welch anova the right way (you can find the output of R below)
here it the output which i did till now for the statistical evalutation:
data.frame': 30 obs. of 3 variables:
$ Dauer: Factor w/ 6 levels "0","2","4","6",..: 1 2 3 4 5 6 1 2 3 4 ...
$ WH : Factor w/ 5 levels "r1","r2","r3",..: 1 1 1 1 1 1 2 2 2 2 ...
$ TSO2 : num 107 86 98 97 88 95 93 96 96 99 ...
> leveneTest(TSO2~Dauer, data=TSO2R)
`Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 5 3.3491 0.01956 *
24
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1`
`> oneway.test (TSO2 ~Dauer, data=TSO2R, var.equal = FALSE) ###Welch-ANOVA
One-way analysis of means (not assuming equal variances)
data: TSO2 and Dauer
F = 5.7466, num df = 5.000, denom df = 10.685, p-value = 0.00807
'''`
Thank you very much!

Using 2 worksheets to calculate in sql

I have data that goes back to 2004 to deal with so have to simplify calculations moving from using Excel to using SQL to save processing time & pressure on our servers.
My data is:
Period Employee EmOrg EmType Total Hours Mode
201306 GOVINP1 RSA/PZB/T00 S 180 66
201306 LANDCJ1 RSA/PZB/T00 S 200 35
201306 WOODRE RSA/PZB/T00 S 180 34
201306 MOKOHM1 RSA/JNB/T00 S 160 33
201306 KAPPPJ RSA/PLZ/T00 S 160 32
201306 CAHISJ RSA/PZB/T00 S 187 31
201306 ZEMUN RSA/PZB/T00 S 180 31
201306 SAULDD1 RSA/PZB/T00 S 190 28
201306 JEROP1 RSA/DUR/T00 S 188 26
201306 NGOBS1 RSA/PZB/T00 S 204 24
201306 ZONDNS2 RSA/PZB/T00 S 192 23
201306 DLAMMP RSA/PZB/T00 S 201 23
201306 MPHURK RSA/PLZ/T00 S 160 22
201306 MNDAMB RSA/PZB/T00 S 188 21
My desired outcome is:
Period EmOrg EmType TotalHours FTE S
201308 RSA/BFN/T00 S 198 1
201308 RSA/CPT/T00 S 744 3.757575
201308 RSA/DUR/T00 S 805 4.065656
201308 RSA/JNB/T00 S 396 2
201308 RSA/PLZ/T00 S 563 2.843434
201308 RSA/PTA/T00 S 594 3
201308 RSA/PZB/T00 S 4882 24.656565
And my query:
SELECT
LD.Period,
LD.EmOrg,
LD.EmType,
Sum(LD.RegHrs) AS 'Total Hours',
Sum(LD.RegHrs) / 198 As 'FTE_S'
FROM
SSI.dbo.LD LD
GROUP BY LD.Period , LD.EmOrg , LD.EmType
HAVING (LD.EmOrg Like '%T00')
AND (LD.EmType = 'S')
How do I refer to a column in a different worksheet to use as my Mode rather than dividing with an actual number? Because different months have a different mode and using an actual number will give wrong output in other months.
You need to create a separate table for your Mode per month and than use JOIN to get that value and use it.
Something like this:
SELECT
LD.Period,
LD.EmOrg,
LD.EmType,
Sum(LD.RegHrs) AS 'Total Hours',
Sum(LD.RegHrs) / M.Mode As 'FTE_S'
FROM
SSI.dbo.LD LD
INNER JOIN SSI.dbo.Mode M
ON LD.Period = M.Period -- Not sure its should be Period or Month
GROUP BY LD.Period , LD.EmOrg , LD.EmType
HAVING (LD.EmOrg Like '%T00')
AND (LD.EmType = 'S')