Only show significant correlations in heatmap - heatmap

I used the corr.test function to get the correlation values between to matrices and their p adjusted values.:
test <- corr.test(x, y, method="spearman", adjust="BH")
I have the correlation in test$r and the p adjusted values in test$p.adj. The rownames and colnames of both matrices are the same
I would like to create a heatmap where I only show correlations that pass the p.adj values of <0.5. Any help is appreciated

Related

Understanding log_prob for Normal distribution in pytorch

I'm currently trying to solve Pendulum-v0 from the openAi gym environment which has a continuous action space. As a result, I need to use a Normal Distribution to sample my actions. What I don't understand is the dimension of the log_prob when using it :
import torch
from torch.distributions import Normal
means = torch.tensor([[0.0538],
[0.0651]])
stds = torch.tensor([[0.7865],
[0.7792]])
dist = Normal(means, stds)
a = torch.tensor([1.2,3.4])
d = dist.log_prob(a)
print(d.size())
I was expecting a tensor of size 2 (one log_prob for each actions) but it output a tensor of size(2,2).
However, when using a Categorical distribution for discrete environment the log_prob has the expected size:
logits = torch.tensor([[-0.0657, -0.0949],
[-0.0586, -0.1007]])
dist = Categorical(logits = logits)
a = torch.tensor([1, 1])
print(dist.log_prob(a).size())
give me a tensor a size(2).
Why is the log_prob for Normal distribution of a different size ?
If one takes a look in the source code of torch.distributions.Normal and finds the definition of the log_prob(value) function, one can see that the main part of the calculation is:
return -((value - self.loc) ** 2) / (2 * var) - some other part
where value is a variable containing values for which you want to calculate the log probability (in your case, a), self.loc is the mean of the distribution (in you case, means) and var is the variance, that is, the square of the standard deviation (in your case, stds**2). One can see that this is indeed the logarithm of the probability density function of the normal distribution, minus some constants and logarithm of the standard deviation that I don't write above.
In the first example, you define means and stds to be column vectors, while the values to be a row vector
means = torch.tensor([[0.0538],
[0.0651]])
stds = torch.tensor([[0.7865],
[0.7792]])
a = torch.tensor([1.2,3.4])
But subtracting a row vector from a column vector, that the code does in value - self.loc in Python gives a matrix (try!), thus the result you obtain is a value of log_prob for each of your two defined distribution and for each of the variables in a.
If you want to obtain a log_prob without the cross terms, then define the variables consistently, i.e., either
means = torch.tensor([[0.0538],
[0.0651]])
stds = torch.tensor([[0.7865],
[0.7792]])
a = torch.tensor([[1.2],[3.4]])
or
means = torch.tensor([0.0538,
0.0651])
stds = torch.tensor([0.7865,
0.7792])
a = torch.tensor([1.2,3.4])
This is how you do in your second example, which is why you obtain the result you expected.

Fastest way to create mapped matrix in numpy

I have the following problem. I have a 2D numpy matrix of integers, let's say they're in the range 0-19. The matrix has shape (r, c). Call this matrix M. I have an additional array, a lookup table, A, of length 19 whose elements are small numpy vectors, each with length N. What I want to do is to create a new matrix of shape (r, c, N) where I've replaced the integer in the original matrix by it's lookup table counterpart.
Simple enough, it's just a function on a matrix which produces a new matrix with an additional dimension. I have written code for this which looks like:
num_rows, num_cols = M.shape
result = np.zeros(num_rows, num_cols, 3)
for col in range(num_cols):
for row in range(num_ros):
idx = M[row, col]
result[row, col] = np.array(A[idx])
This works but the problem is it's is too slow. If M has 500,000 elements then it takes like 600ms per matrix which is the bottleneck in my code. There has to be a clever numpy way of handling this which is faster but I can't think of it.

Stata: saving regressions coefficients and standard errors in .dta file when there are factor variables

I would like to run several regressions and store their results in a DTA file that I could later use for analysis. My constraints are:
I cannot install modules (I am writing code for other people and not
sure what modules they have installed)
Some of the regressors are factor variables.
Each regression differ only by the dependent variable, so I would like to store that in the final dataset to keep track of what regression the coefficients/variances correspond to.
I am seriously losing sanity here. I feel it's probably simple given that Stata is statistics software but svmat is really not cooperative. Currently what I am doing is this:
sysuse census, clear
generate constant = 1
capture matrix drop regsresults // erase previously existing matrix
foreach depvar in marriage divorce {
reg `depvar' popurban i.region constant, robust noconstant // regressions
matrix result_matrix = e(b)\vecdiag(e(V)) // grab coeffs and their variances in a 2xK matrix
matrix rownames result_matrix = `depvar'_b `depvar'_v // add rownames to the two extra rows
matrix regsresults = nullmat(regsresults)\result_matrix // add those results matrix to the existing ones
}
matrix list regsresults
clear
svmat regsresults, names(col)
This creates for each regression: one row that stores the coefficients, one row that stores their variance using vecdiag(e(V)). The row names for those two rows are the dependent variable name, followed by _b for coeffs and _v for variances.
I use a manual constant because _cons is not a valid name for a variable when using svmat.
Of course my "solution" does not work because factor levels generate strange matrix column names which are then invalid variable names when calling svmat. (The error is a terse invalid syntax.) I'd be happy with ANY solution to overcome that problem, given my constraints. It doesn't have to use svmat, coefficients and variances can be on same line if it makes it easier, etc.
Renaming matrix columns is one option:
sysuse census, clear
generate constant = 1
capture matrix drop regsresults // erase previously existing matrix
foreach depvar in marriage divorce {
reg `depvar' popurban i.region constant, robust noconstant // regressions
matrix result_matrix = e(b)\vecdiag(e(V)) // grab coeffs and their variances in a 2xK matrix
matrix rownames result_matrix = `depvar'_b `depvar'_v // add rownames to the two extra rows
matrix regsresults = nullmat(regsresults)\result_matrix // add those results matrix to the existing ones
}
matrix list regsresults
matname regsresults reg1 reg2 reg3 reg4, columns(2..5) explicit
clear
svmat regsresults, names(col)
For more complex namelists (reg1 - reg4), you can build the syntax beforehand, store in a local, and then use with matname.
Edit
The same strategy, with some automatation. It uses macro extended functions for matrices. See help extended_fcn.
sysuse census, clear
generate constant = 1
capture matrix drop regsresults // erase previously existing matrix
foreach depvar in marriage divorce {
reg `depvar' popurban i.region constant, robust noconstant // regressions
matrix result_matrix = e(b)\vecdiag(e(V)) // grab coeffs and their variances in a 2xK matrix
matrix rownames result_matrix = `depvar'_b `depvar'_v // add rownames to the two extra rows
matrix regsresults = nullmat(regsresults)\result_matrix // add those results matrix to the existing ones
}
// list the matrix
matrix list regsresults
// get original column names of matrix
local names : colfullnames regsresults
// get original row names of matrix (and row count)
local rownames : rowfullnames regsresults
local c : word count `rownames'
// make original names legal variable names
local newnames
foreach name of local names {
local newnames `newnames' `=strtoname("`name'")'
}
// rename columns of matrix
matrix colnames regsresults = `newnames'
// convert matrix to dataset
clear
svmat regsresults, names(col)
// add matrix row names to dataset
gen rownames = ""
forvalues i = 1/`c' {
replace rownames = "`:word `i' of `rownames''" in `i'
}
// list
order rownames
list, noobs
See also ssc describe matnames.
Just for completeness, using Roberto excellent solution, this is the final code:
sysuse census, clear
generate constant = 1
capture matrix drop regsresults // erase previously existing matrix
replace region = region +15
foreach depvar in marriage divorce {
reg `depvar' popurban i.region constant, robust noconstant // regressions
matrix result_matrix = e(b)\vecdiag(e(V)) // grab coeffs and their variances in a 2xK matrix
matrix rownames result_matrix = `depvar'_b `depvar'_v // add rownames to the two extra rows
matrix regsresults = nullmat(regsresults)\result_matrix // add those results matrix to the existing ones
}
matrix list regsresults
local rownames : rownames regsresults // collects row names
local colnames : colfullnames regsresults // collects column names
local newnames // clean column names
foreach name of local colnames {
local newnames `newnames' `=strtoname("`name'")'
}
matrix colnames regsresults = `newnames' // attribute the cleaned column names
clear
svmat regsresults, names(col)
// add the row names as its own variable rown
gen str rown = ""
order rown,
local i = 1
foreach rowname in `rownames' {
replace rown = "`rowname'" if _n == `i'
local i = `i' + 1
}
br

How to determine width of peaks and make FFT for every peak (and plot it in separate graph)

I have an acceleration data for X-axis and time vector for it. I determined the peaks more than threshold and now I should find the FFT for every peak.
As result I have this:
Peak Value 1 = 458, index 1988
Peak Value 2 = 456, index 1990
Peak Value 3 = 450, index 12081
....
Peak Value 9 = 432, index 12151
To find these peaks I used the peakfinder script.
The command [peakLoc, peakMag] = peakfinder(x0,...) gives me location and magnitude of peaks.
Also I have the Time (from time data vector) for each peak.
So what I suppose, that I should take every peak, find its width (or some data points around the peak) and make the FFT. Am I right? Could you help me in that?
I'm working in Octave and I'm new here :)
Code:
load ("C:\\..patch..\\peakfinder.m");
d =dlmread("C:\\..patch..\\acc2.csv", ";");
T=d(:,1);
Ax=d(:,2);
[peakInd peakVal]=peakfinder(Ax,10,430,1);
peakTime=T(peakInd);
[sortVal sortInd] = sort(peakVal, 'descend');
originInd = peakInd(sortInd);
for k = 1 : length(sortVal)
fprintf(1, 'Peak #%d = %d, index%d\n', k, sortVal(k), originInd (k));
end
plot(T,Ax,'b-',T(peakInd),Ax(peakInd),'rv');
and here you can download the data http://www.filedropper.com/acc2
FFT
d =dlmread("C:\\..path..\\acc2.csv", ";");
T=d(:,1);
Ax=d(:,2);
% sampling frequency
Fs_a=2000;
% length of FFT
Length_Ax=numel(Ax);
% number of lines of Fourier spectrum
fft_L= Fs_a*2;
% an array of time samples
T_Ax=0:1/Fs_a: Length_Ax;
fft_Ax=abs(fft(Ax,fft_L));
fft_Ax=2*fft_Ax./fft_L;
F=0:Fs_a/fft_L:Fs_a/2-1/fft_L;
subplot(3,1,1);
plot(T,Ax);
title('Ax axis');
xlabel('time (s)');
ylabel('amplitude)'); grid on;
subplot(3,1,2);
plot(F,fft_Ax(1:length(F)));
title('spectrum max Ax axis');
xlabel('frequency (Hz)');
ylabel('amplitude'); grid on;
It looks like you have two clusters of peaks, so I would plot the data over three plots: one of the whole timeseries, one zoomed in on the first cluster, and the last one zoomed in on the second cluster (note I have divided all your time values by 1e6 otherwise the tick labels get ugly):
figure
subplot(3,1,1)
plot(T/1e6,Ax,'b-',peakTime/1e6,peakVal,'rv');
subplot(3,1,2)
plot(T/1e6,Ax,'b-',peakTime(1:4)/1e6,peakVal(1:4),'rv');
axis([0.99*peakTime(1)/1e6 1.01*peakTime(4)/1e6 0.9*peakVal(1) 1.1*peakVal(4)])
subplot(3,1,3)
plot(T/1e6,Ax,'b-',peakTime(5:end)/1e6,peakVal(5:end),'rv');
axis([0.995*peakTime(5)/1e6 1.005*peakTime(end)/1e6 0.9*peakVal(5) 1.1*peakVal(end)])
I have set the axes around the extreme time and acceleration values, using some coefficients to have some "padding" around (the values of these coefficients were obtained through trial and error). This gives me the following plot, hopefully this is the sort of thing you are after. You can add x and y labels if you wish.
EDIT
Here's how I would do the FFT:
Fs = 2000;
L = length(Ax);
NFFT = 2^nextpow2(L); % Next power of 2 from length of Ax
Ax_FFT = fft(Ax,NFFT)/L;
f = Fs/2*linspace(0,1,NFFT/2+1);
% Plot single-sided amplitude spectrum.
figure
semilogx(f,2*abs(Ax_FFT(1:NFFT/2+1))) % using semilogx as huge DC component
title('Single-Sided Amplitude Spectrum of Ax')
xlabel('Frequency (Hz)')
ylabel('|Ax(f)|')
ylim([0 300])
giving the following result:

Returning a function in a list, from a function

I searched for this question, but found answers that weren't specific enough.
I'm cleaning up old code and I'm trying to make sure that the following is relatively clean, and hoping that it won't bite me on the rear later on.
My question is about passing a function through a function. Look at the "y" part of the following plot statement. The goo(df)[[1]](x) thing works, but am I asking for trouble in any way? If so, is there a cleaner way?
Also, if the goo() function is called many many times, for instance in a Monte Carlo analysis, will this load up R's internals or possibly cause some type of environment issues?
Edit (02/21/2011) --- The following code is just an example. The real function "goo" has a lot of code before it gets to the approxfun() statement.
#Build a dataframe
df <- data.frame(a=c(1, 2, 3, 4, 5), b=c(4, 3, 1, 2, 6))
#Build a function that passes a function
goo <- function(inp.df) {
out.fun <- approxfun(x=inp.df$a, y=inp.df$b, yright=max(inp.df$b), method="linear", f=1)
list(out.fun, inp.df$a[5], inp.df$b[5])
}
#Set up the plot range
x <- seq(1, 4.3, 0.01)
#Plot the function
plot(x, goo(df)[[1]](x), type="l", xlim=c(0, goo(df)[[2]]), ylim=c(0, goo(df)[[3]]), lwd=2, col="red")
grid()
goo(df)
[[1]]
function (v)
.C("R_approxfun", as.double(x), as.double(y), as.integer(n),
xout = as.double(v), as.integer(length(v)), as.integer(method),
as.double(yleft), as.double(yright), as.double(f), NAOK = TRUE,
PACKAGE = "stats")$xout
<environment: 0219d56c>
[[2]]
[1] 5
[[3]]
[1] 6
It's hard to give you specific recommendations without knowing exactly what your code is, but here are a few things to consider:
Is it really necessary to include pieces of goo's input data in its return value? In other words, can you make goo a straightforward factory that just returns a function? In your example, at least, the plot function already has all the data it needs to determine the limits.
If this is not possible, then stay with this pattern, but give the elements of goo's return value descriptive names so that at least it's easy to see what's going on when you reference them. (E.g., goo(df)$approx(x).) If this structure is used widely in your code, consider making it an S3 class.
Finally, don't invoke goo(df) multiple times in the plot function, just to get different elements out. When you do that, you literally call goo every time, which as you said will execute a lot of code. Also, each invocation will have its own environment with a copy of the input data (although R will be smart enough to reduce the copying to a certain extent and use the same physical instance of df.) Instead, call goo once, assign its value to a variable, and reference that variable subsequently.
I would remove a level of function handling and keep the input data out of the function generation. Then you can keep your function out of the goo and call approxfun only once.
It also generalizes to an input dataframe of any size, not just one with 5 rows.
#Build a dataframe
df <- data.frame(a=c(1, 2, 3, 4, 5), b=c(4, 3, 1, 2, 6))
#Build a function
fun <- approxfun(x = df$a, y = df$b, yright=max(df$b), method="linear", f = 1)
#Set up the plot range
x <- seq(1, 4.3, 0.01)
#Plot the function
plot(x, fun(x), type="l", xlim=c(0, max(df$a)), ylim=c(0, max(df$b)), lwd=2, col="red")
That might not be quite what you need ultimately, but it does remove a level of complexity and gives a cleaner starting point.
This might not be better in a big Monte Carlo simulation, but for simpler situations, it might be clearer to include the x and y ranges as attributes of the output from the created function instead of in a list with the created function. This way goo is a more straightforward factory, like Davor mentions. You could also make the result from your function an object (here using S3) so that it can be plotted more simply.
goo <- function(inp.df) {
out.fun <- approxfun(x=inp.df$a, y=inp.df$b, yright=max(inp.df$b),
method="linear", f=1)
xmax <- inp.df$a[5]
ymax <- inp.df$b[5]
function(...) {
structure(data.frame(x=x, y = out.fun(...)),
limits=list(x=xmax, y=ymax),
class=c("goo","data.frame"))
}
}
plot.goo <- function(x, xlab="x", ylab="approx",
xlim=c(0, attr(x, "limits")$x),
ylim=c(0, attr(x, "limits")$y),
lwd=2, col="red", ...) {
plot(x$x, x$y, type="l", xlab=xlab, ylab=ylab,
xlim=xlim, ylim=ylim, lwd=lwd, col=col, ...)
}
Then to make the function for a data frame, you'd do:
df <- data.frame(a=c(1, 2, 3, 4, 5), b=c(4, 3, 1, 2, 6))
goodf <- goo(df)
And to use it on a vector, you'd do:
x <- seq(1, 4.3, 0.01)
goodfx <- goodf(x)
plot(goodfx)