crosstables for survey data (weighted and unweighted) - regression

I have survey data that I am working on. I need to make some tables and regression analyses on the data.
After attaching the data, this is the code I use for tables for four variables:
ftable(var1, var2, var3, var4)
And this is the regression code that I use for the data:
logit.1 <- glm(var4 ~ var3 + var2 + var1, family = binomial(link = "logit"))
summary(logit.1)
So far so good for the unweighted analyses. But how can I do the same analyses for the weighted data? Here is some additional info:
There are four variables in the dataset that reflect the sampling structure. These are
strat: stratum (urban or (sub-county) rural).
clust: batch of interviews that were part of the same random walk
vill_neigh_code: village or neighbourhood code
sweight: weights

library(survey)
data(api)
# example data set
head( apiclus2 )
# instead of var1 - var4, use these four variables:
ftable( apiclus2[ , c( 'sch.wide' , 'comp.imp' , 'both' , 'awards' ) ] )
# move it over to x for faster typing
x <- apiclus2
# also give x a column of all ones
x$one <- 1
# run the glm() function specified.
logit.1 <-
glm(
comp.imp ~ target + cnum + growth ,
data = x ,
family = binomial( link = 'logit' )
)
summary( logit.1 )
# now create the survey object you've described
dclus <-
svydesign(
id = ~dnum + snum , # cluster variable(s)
strata = ~stype , # stratum variable
weights = ~pw , # weight variable
data = x ,
nest = TRUE
)
# weighted counts
svyby(
~one ,
~ sch.wide + comp.imp + both + awards ,
dclus ,
svytotal
)
# weighted counts formatted differently
ftable(
svyby(
~one ,
~ sch.wide + comp.imp + both + awards ,
dclus ,
svytotal ,
keep.var = FALSE
)
)
# run the svyglm() function specified.
logit.2 <-
svyglm(
comp.imp ~ target + cnum + growth ,
design = dclus ,
family = binomial( link = 'logit' )
)
summary( logit.2 )

Related

R - specifying interaction contrasts for aov

How to specificy the contrasts (point estimates, 95CI and p-values) for the between-group differences of the within-group delta changes?
In the example below, I would be interest in the between-groups (group = 1 minus group = 2) of delta changes (time = 3 minus time = 1).
df and model:
demo3 <- read.csv("https://stats.idre.ucla.edu/stat/data/demo3.csv")
## Convert variables to factor
demo3 <- within(demo3, {
group <- factor(group)
time <- factor(time)
id <- factor(id)
})
par(cex = .6)
demo3$time <- as.factor(demo3$time)
demo3.aov <- aov(pulse ~ group * time + Error(id), data = demo3)
summary(demo3.aov)
Neither of these chunks of code achieve my goal, correct?
m2 <- emmeans(demo3.aov, "group", by = "time")
pairs(m2)
m22 <- emmeans(demo3.aov, c("group", "time") )
pairs(m22)
Look at the documentation for emmeans::contrast and in particular the argument interaction. If I understand your question correctly, you might want
summary(contrast(m22, interaction = c("pairwise", "dunnett")),
infer = c(TRUE, TRUE))
which would compute Dunnett-style contrasts for time (each time vs. time1), and compare those for group1 - group2. The summary(..., infer = c(TRUE, TRUE)) part overrides the default that tests but not CIs are shown.
You could also do this in stanges:
time.con <- contrast(m22, "dunnett", by = "group", name = "timediff")
summary(pairs(time.con, by = NULL), infer = c(TRUE, TRUE))
If you truly want just time 3 - time 1, then replace time.con with
time.con1 <- contrast(m22, list(`time3-time1` = c(-1, 0, 1, 0, 0))
(I don't know how many times you have. I assumed 5 in the above.)

How to display time of day on a ggplot axis after using SQL UNIX_TIMESTAMP()?

I am working with data returned by a query similar to this:
SELECT UNIX_TIMESTAMP(timestamp) DIV 300 AS period, COUNT(*) as count from tbl
GROUP BY UNIX_TIMESTAMP(timestamp) DIV 300
which is grouping the counts into 5 minute intervals and is then imported into R and looks like this:
set.seed(1)
mydata <- data.frame(period = seq(5391360, 5391647), count = rpois(288, 4))
head(mydata)
## period count
## 1 5391360 3
## 2 5391361 3
## 3 5391362 4
## 4 5391363 7
## 5 5391364 2
## 6 5391365 7
I then plot them like this:
I would now like to plot this with ggplot, where the x axis shows the actual time starting in hourly intervals, 01:00, 02:00 03:00 etc. I have been doing this by piping the data into:
ggplot(aes(y = count, x = period)) + geom_bar(stat = "identity") +
ggtitle("5 min counts") +
theme(plot.title = element_text(lineheight=.8, face="bold", hjust = 0.5),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
which produces this:
However, as mentioned above I would like the x axis to have hourly labels: 01:00, 02:00 etc
In this solution, first I create a vector of datetime values. The vector df1$period is multiplied by 300 and coerced to class "POSIXct. Then the hours and minutes are kept.
period <- as.POSIXct(df1$period*300, origin = "1970-01-01")
period <- format(period, "%H:%M")
library(ggplot2)
ggplot(data = data.frame(period, count = df1$count),
mapping = aes(period, count)) +
geom_col(position = position_dodge())
To have a plot by hour, instead of keeping the hours and minutes, use format to keep the hours only. But then aggregate the counts by hour.
set.seed(1)
mydata <- data.frame(period = seq(5391360, 5391647), count = rpois(288, 4))
mydata$hour <- as.POSIXct(mydata$period*300, origin = "1970-01-01")
mydata$hour <- format(mydata$hour, "%H")
agg <- aggregate(count ~ hour, mydata, sum)
library(ggplot2)
ggplot(data = agg, aes(hour, count)) +
geom_col(position = position_dodge())

How to use Dataparallel when there is ‘for’ cycle in the network?

I have a server with two GPUs, if I use one GPU I need more than 10 days to finish 1000 epoch. But when I tried to use Dataparallel, the programm didn’t work. It seems because there is a ‘for’ cycle in my network. So how can I use Dataparallel in this case. Or is there any other solution to speed up training?
class WaveNet( nn.Module ):
def __init__(self, mu, n_residue, n_skip, dilation_depth, n_repeat):
# mu: audio quantization size
# n_residue: residue channels
# n_skip: skip channels
# dilation_depth & n_repeat: dilation layer setup
self.mu = mu
super( WaveNet, self ).__init__()
self.dilation_depth = dilation_depth
dilations = self.dilations = [2 ** i for i in range( dilation_depth )] * n_repeat
self.one_hot = One_Hot( mu )
self.from_input = nn.Conv1d( in_channels=mu, out_channels=n_residue, kernel_size=1 )
self.from_input = nn.DataParallel(self.from_input)
self.conv_sigmoid = nn.ModuleList(
[nn.Conv1d( in_channels=n_residue, out_channels=n_residue, kernel_size=2, dilation=d )
for d in dilations] )
self.conv_sigmoid = nn.DataParallel(self.conv_sigmoid)
self.conv_tanh = nn.ModuleList(
[nn.Conv1d( in_channels=n_residue, out_channels=n_residue, kernel_size=2, dilation=d )
for d in dilations] )
self.conv_tanh = nn.DataParallel(self.conv_tanh)
self.skip_scale = nn.ModuleList( [nn.Conv1d( in_channels=n_residue, out_channels=n_skip, kernel_size=1 )
for d in dilations] )
self.skip_scale = nn.DataParallel(self.skip_scale)
self.residue_scale = nn.ModuleList( [nn.Conv1d( in_channels=n_residue, out_channels=n_residue, kernel_size=1 )
for d in dilations] )
self.residue_scale = nn.DataParallel(self.residue_scale)
self.conv_post_1 = nn.Conv1d( in_channels=n_skip, out_channels=n_skip, kernel_size=1 )
self.conv_post_1 = nn.DataParallel(self.conv_post_1)
self.conv_post_2 = nn.Conv1d( in_channels=n_skip, out_channels=mu, kernel_size=1 )
self.conv_post_2 = nn.DataParallel(self.conv_post_2)
def forward(self, input, train=True):
output = self.preprocess( input, train )
skip_connections = [] # save for generation purposes
for s, t, skip_scale, residue_scale in zip( self.conv_sigmoid, self.conv_tanh, self.skip_scale,
self.residue_scale ):
output, skip = self.residue_forward( output, s, t, skip_scale, residue_scale )
skip_connections.append( skip )
# sum up skip connections
output = sum( [s[:, :, -output.size( 2 ):] for s in skip_connections] )
output = self.postprocess( output, train )
return output
TypeError: zip argument #1 must support iteration

Assigned Octave variable not being saved to file

In the Octave script below I am looping through files in a directory, loading them in to Octave to do some manipulation on data, and then attempting to write the manipulated data ( a matrix ) to a new file whose name is derived from the name of the input file. The manipulated data is assigned to a variable name that has the same name as the file that it is to be saved in. All unwanted variables are cleared and the save command should save/write the single, assigned variable matrix to the file "new_filename."
However, this last save/write command is not happening, and I don't understand why not. Without specific variable commands, the save function should save all variables in scope, in this case there only being the one matrix to save. Why is this not working?
clear all ;
all_raw_OHLC_files = glob( "*_raw_OHLC_daily" ) ; % cell with filenames matching *_raw_OHLC_daily
for ii = 1 : length( all_raw_OHLC_files ) % loop for length of above cell
filename = all_raw_OHLC_files{ii} ; % get files' names
% create a new filename for the output file
split_filename = strsplit( filename , "_" ) ;
new_filename = tolower( [ split_filename{1} "_" split_filename{2} "_ohlc_daily" ] ) ;
% open and read file
fid = fopen( filename , 'rt' ) ;
data = textscan( fid , '%s %f %f %f %f %f %s' , 'Delimiter' , ',' , 'CollectOutput', 1 ) ;
fclose( fid ) ;
ex_data = [ datenum( data{1} , 'yyyy-mm-dd HH:MM:SS' ) data{2} ] ; % extract the file's data
% process the raw data in to OHLC bars
weekday_ix = weekday( ex_data( : , 1 ) ) ;
% find Mondays immediately preceeded by Sundays in the data
monday_ix = find( ( weekday_ix == 2 ) .* ( shift( weekday_ix , 1 ) == 1 ) ) ;
sunday_ix = monday_ix .- 1 ;
% replace Monday open with the Sunday open
ex_data( monday_ix , 2 ) = ex_data( sunday_ix , 2 ) ;
% replace Monday high with max of Sunday high and Monday high
ex_data( monday_ix , 3 ) = max( ex_data( sunday_ix , 3 ) , ex_data( monday_ix , 3 ) ) ;
% repeat for min of lows
ex_data( monday_ix , 4 ) = min( ex_data( sunday_ix , 4 ) , ex_data( monday_ix , 4 ) ) ;
% combines volume figures
ex_data( monday_ix , 6 ) = ex_data( sunday_ix , 6 ) .+ ex_data( monday_ix , 6 ) ;
% now delete the sunday data
ex_data( sunday_ix , : ) = [] ;
assignin( "base" , tolower( [ split_filename{1} "_" split_filename{2} "_ohlc_daily" ] ) , ex_data )
clear ans weekday_ix sunday_ix monday_ix ii filename split_filename fid ex_data data all_raw_OHLC_files
% print to file
save new_filename
endfor
save new_filename saves the current workspace to a file with the filename "new_filename". I guess what you want is to create a file with a filename that is stored in "new_filename":
save (new_filename);
Your current approach of "clearing all I don't need and then store the whole workspace" is IMHO very ugly and you should instead explicitly store ex_data if this is the only part wou want:
save (new_filename, "ex_data");

Using a metric predictor when modelling ordinal predicted variable in PyMC3

I am trying to implement the ordered probit regression model from chapter 23.4 of Doing Bayesian Data Analysis (Kruschke) in PyMC3. After sampling, the posterior distribution for the intercept and slope are not really comparable to the results from the book. I think there is some fundamental issue with the model definition, but I fail to see it.
Data:
X is the metric predictor (standardized to zX), Y are ordinal outcomes (1-7).
nYlevels3 = df3.Y.nunique()
# Setting the thresholds for the ordinal outcomes. The outer sides are
# fixed, while the others are estimated.
thresh3 = [k + .5 for k in range(1, nYlevels3)]
thresh_obs3 = np.ma.asarray(thresh3)
thresh_obs3[1:-1] = np.ma.masked
#as_op(itypes=[tt.dvector, tt.dvector, tt.dscalar], otypes=[tt.dmatrix])
def outcome_probabilities(theta, mu, sigma):
out = np.empty((mu.size, nYlevels3))
n = norm(loc=mu, scale=sigma)
out[:,0] = n.cdf(theta[0])
out[:,1] = np.max([np.repeat(0,mu.size), n.cdf(theta[1]) - n.cdf(theta[0])])
out[:,2] = np.max([np.repeat(0,mu.size), n.cdf(theta[2]) - n.cdf(theta[1])])
out[:,3] = np.max([np.repeat(0,mu.size), n.cdf(theta[3]) - n.cdf(theta[2])])
out[:,4] = np.max([np.repeat(0,mu.size), n.cdf(theta[4]) - n.cdf(theta[3])])
out[:,5] = np.max([np.repeat(0,mu.size), n.cdf(theta[5]) - n.cdf(theta[4])])
out[:,6] = 1 - n.cdf(theta[5])
return out
with pm.Model() as ordinal_model_metric:
theta = pm.Normal('theta', mu=thresh3, tau=np.repeat(1/2**2, len(thresh3)),
shape=len(thresh3), observed=thresh_obs3, testval=thresh3[1:-1])
# Intercept
zbeta0 = pm.Normal('zbeta0', mu=(1+nYlevels3)/2, tau=1/nYlevels3**2)
# Slope
zbeta = pm.Normal('zbeta', mu=0.0, tau=1/nYlevels3**2)
# Mean of the underlying metric distribution
mu = pm.Deterministic('mu', zbeta0 + zbeta*zX)
zsigma = pm.Uniform('zsigma', nYlevels3/1000.0, nYlevels3*10.0)
pr = outcome_probabilities(theta, mu, zsigma)
y = pm.Categorical('y', pr, observed=df3.Y.cat.codes)
http://nbviewer.jupyter.org/github/JWarmenhoven/DBDA-python/blob/master/Notebooks/Chapter%2023.ipynb
For reference, here is the JAGS model used by Kruschke on which I based my model:
Ntotal = length(y)
# Threshold 1 and nYlevels-1 are fixed; other thresholds are estimated.
# This allows all parameters to be interpretable on the response scale.
nYlevels = max(y)
thresh = rep(NA,nYlevels-1)
thresh[1] = 1 + 0.5
thresh[nYlevels-1] = nYlevels-1 + 0.5
modelString = "
model {
for ( i in 1:Ntotal ) {
y[i] ~ dcat( pr[i,1:nYlevels] )
pr[i,1] <- pnorm( thresh[1] , mu[x[i]] , 1/sigma[x[i]]^2 )
for ( k in 2:(nYlevels-1) ) {
pr[i,k] <- max( 0 , pnorm( thresh[ k ] , mu[x[i]] , 1/sigma[x[i]]^2 )
- pnorm( thresh[k-1] , mu[x[i]] , 1/sigma[x[i]]^2 ) )
}
pr[i,nYlevels] <- 1 - pnorm( thresh[nYlevels-1] , mu[x[i]] , 1/sigma[x[i]]^2 )
}
for ( j in 1:2 ) { # 2 groups
mu[j] ~ dnorm( (1+nYlevels)/2 , 1/(nYlevels)^2 )
sigma[j] ~ dunif( nYlevels/1000 , nYlevels*10 )
}
for ( k in 2:(nYlevels-2) ) { # 1 and nYlevels-1 are fixed, not stochastic
thresh[k] ~ dnorm( k+0.5 , 1/2^2 )
}
}
It was not a fundamental issue after all: I forgot to indicate the axis for np.max() in the function below.
#as_op(itypes=[tt.dvector, tt.dvector, tt.dscalar], otypes=[tt.dmatrix])
def outcome_probabilities(theta, mu, sigma):
out = np.empty((mu.size, nYlevels3))
n = norm(loc=mu, scale=sigma)
out[:,0] = n.cdf(theta[0])
out[:,1] = np.max([np.repeat(0,mu.size), n.cdf(theta[1]) - n.cdf(theta[0])], axis=0)
out[:,2] = np.max([np.repeat(0,mu.size), n.cdf(theta[2]) - n.cdf(theta[1])], axis=0)
out[:,3] = np.max([np.repeat(0,mu.size), n.cdf(theta[3]) - n.cdf(theta[2])], axis=0)
out[:,4] = np.max([np.repeat(0,mu.size), n.cdf(theta[4]) - n.cdf(theta[3])], axis=0)
out[:,5] = np.max([np.repeat(0,mu.size), n.cdf(theta[5]) - n.cdf(theta[4])], axis=0)
out[:,6] = 1 - n.cdf(theta[5])
return out