Vectorization of feature scaling - octave

I want to feature scale a matrix (X) with 2 columns. I am using mean normalization, and I wrote the following lines in Octave:
X_norm = X
mu = mean(X);
sigma = std(X);
X_norm(:,1) = (X_norm(:,1) .- mu(:,1)) ./ sigma(:,1);
X_norm(:,2) = (X_norm(:,2) .- mu(:,2)) ./ sigma(:,2);
Can you please let me know a cleaner way to vectorize these calculation?
I checked my code by comparing with the result from zscore(X) and they matched - i.e. a sum(X_norm - zscore(X)) returned me 0 0.
I am constrained to not use zscore(), and hence the question.
Sample data as follows:
2104 3
1600 3
2400 3
1416 2
3000 4
1985 4
1534 3
1427 3
1380 3
1494 3
1940 4
2000 3
1890 3
4478 5
1268 3
2300 4
1320 2
1236 3
2609 4
3031 4
1767 3
1888 2
1604 3
1962 4
3890 3
1100 3
1458 3
2526 3
2200 3
2637 3

You could simply do:
X_norm = (X .- mean(X,1)) ./ std(X,0,1);

During cross validation faced zero division issue.
This worked for me.
mu = mean(X);
X_norm = X - mu;
sigma = std(X);
% Skip zero div
sigmaZeroIdx = sigma == 0;
sigma(1,sigmaZeroIdx) = 1;
X_norm = X_norm ./ sigma;

I think you could apply a for loop for N size of features.
X_norm = X;
mu = zeros(1, size(X, 2));
sigma = zeros(1, size(X, 2));
for iter = 1:num_iters;
mu(1,iter) = mean(X_norm(:,iter));
X_norm(:,iter) = X_norm(:,iter) .- mu(1,iter);
sigma(1,iter) = std(X_norm(:,iter));
X_norm(:,iter) = X_norm(:,iter) ./ mu(1,iter);
end

Related

How to deduce left-hand side matrix from vector?

Suppose I have the following script, which constructs a symbolic array, A_known, and a symbolic vector x, and performs a matrix multiplication.
clc; clearvars
try
pkg load symbolic
catch
error('Symbolic package not available!');
end
syms V_l k s0 s_mean
N = 3;
% Generate left-hand-side square matrix
A_known = sym(zeros(N));
for hI = 1:N
A_known(hI, 1:hI) = exp(-(hI:-1:1)*k);
end
A_known = A_known./V_l;
% Generate x vector
x = sym('x', [N 1]);
x(1) = x(1) + s0*V_l;
% Matrix multiplication to give b vector
b = A_known*x
Suppose A_known was actually unknown. Is there a way to deduce it from b and x? If so, how?
Til now, I only had the case where x was unknown, which normally can be solved via x = b \ A.
Mathematically, it is possible to get a solution, but it actually has infinite solutions.
Example
A = magic(5);
x = (1:5)';
b = A*x;
A_sol = b*pinv(x);
which has
>> A
A =
17 24 1 8 15
23 5 7 14 16
4 6 13 20 22
10 12 19 21 3
11 18 25 2 9
but solves A as A_sol like
>> A_sol
A_sol =
3.1818 6.3636 9.5455 12.7273 15.9091
3.4545 6.9091 10.3636 13.8182 17.2727
4.4545 8.9091 13.3636 17.8182 22.2727
3.4545 6.9091 10.3636 13.8182 17.2727
3.1818 6.3636 9.5455 12.7273 15.9091

Why happened out of bound error in for loop?

Out of bound error occured.
This is Octave language.
for ii=1:1:10
m(ii)=ii*8
q=m(ii)
if (ii>=2)
q(ii).xdot=(q(ii).x-q(ii-1).x)/Ts;
end
end
But error says
q(2): out of bound 1
How can I fixed it?
For this type of assignment you do not need a loop and anyway you need to define Ts.
To calculate differential increase you can use diff
x=(1:1:10)*8
x =
8 16 24 32 40 48 56 64 72 80
octave:5> Ts=2
Ts = 2
octave:6> xdot=diff(x)/Ts
xdot =
4 4 4 4 4 4 4 4 4
octave:7> size(x)
ans =
1 10
octave:8> size(xdot)
ans =
1 9

Julia: Detect and remove duplicate rows from array?

What is the best method of detecting and dropping duplicate rows from an array in Julia?
x = Integer.(round.(10 .* rand(1000,4)))
# In R I would apply the duplicated function.
x = x[duplicated(x),:]
unique is what you are looking for: (this does not answer the question for the detection part.)
julia> x = Integer.(round.(10 .* rand(1000,4)))
1000×4 Array{Int64,2}:
7 3 10 1
7 4 8 9
7 7 3 0
3 4 8 2
⋮
julia> unique(x, 1)
973×4 Array{Int64,2}:
7 3 10 1
7 4 8 9
7 7 3 0
3 4 8 2
⋮
As for the detection part, a dirty fix would be editing this line:
#nref $N A d->d == dim ? sort!(uniquerows) : (indices(A, d))
to:
(#nref $N A d->d == dim ? sort!(uniquerows) : (indices(A, d))), uniquerows
Alternatively, you could define your own unique2 with abovementioned changes:
using Base.Cartesian
import Base.Prehashed
#generated function unique2(A::AbstractArray{T,N}, dim::Int) where {T,N}
......
end
julia> y, idx = unique2(x, 1)
julia> y
960×4 Array{Int64,2}:
8 3 1 5
8 3 1 6
1 1 0 1
8 10 1 10
9 1 8 7
⋮
julia> setdiff(1:1000, idx)
40-element Array{Int64,1}:
99
120
132
140
216
227
⋮
The benchmark on my machine is:
x = rand(1:10,1000,4) # 48 dups
#btime unique2($x, 1);
124.342 μs (31 allocations: 145.97 KiB)
#btime duplicated($x);
407.809 μs (9325 allocations: 394.78 KiB)
x = rand(1:4,1000,4) # 751 dups
#btime unique2($x, 1);
66.062 μs (25 allocations: 50.30 KiB)
#btime duplicated($x);
222.337 μs (4851 allocations: 237.88 KiB)
The result shows that the convoluted-metaprogramming-hashtable way in Base benefits a lot from lower memory allocation.
Julia v1.4 and above you would need to type unique(a, dims=1) where a is your N by 2 Array
julia> a=[2 2 ; 2 2; 1 2; 3 1]
4×2 Array{Int64,2}:
2 2
2 2
1 2
3 1
julia> unique(a,dims=1)
3×2 Array{Int64,2}:
2 2
1 2
3 1
You can also go with:
duplicated(x) = foldl(
(d,y)->(x[y,:] in d[1] ? (d[1],push!(d[2],y)) : (push!(d[1],x[y,:]),d[2])),
(Set(), Vector{Int}()),
1:size(x,1))[2]
This collects a set of seen rows, and outputs the indices of those already seen. This is essentially the minimal effort needed to get the result, so it should be fast.
julia> x = rand(1:2,5,2)
5×2 Array{Int64,2}:
2 1
1 2
1 2
1 1
1 1
julia> duplicated(x)
2-element Array{Int64,1}:
3
5
julia> x[duplicated(x),:]
2×2 Array{Int64,2}:
1 2
1 1

R question about sapply /plyr syntax: how to pass variable values to a function

Is there a way to pass a variable value in ddply/sapply directly to a function without the function (x) notation?
E.g. Instead of:
ddply(bu,.(trial), function (x) print(x$tangle) )
Is there a way to do:
ddply(bu,.(trial), print(tangle) )
I am asking because with many variables this notation becomes very cumbersome.
Thanks!
You can use fn$ in the gsubfn package. Just preface the function in question with fn$ and then you can use a formula notation as shown here:
> library(gsubfn)
>
> # instead of specifying function(x) mean(x) / sd(x)
>
> fn$sapply(iris[-5], ~ mean(x) / sd(x))
Sepal.Length Sepal.Width Petal.Length Petal.Width
7.056602 7.014384 2.128819 1.573438
> library(plyr)
> # instead of specifying function(x) colMeans(x[-5]) / sd(x[-5])
>
> fn$ddply(iris, .(Species), ~ colMeans(x[-5]) / sd(x[-5]))
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 14.20183 9.043319 8.418556 2.334285
2 versicolor 11.50006 8.827326 9.065547 6.705345
3 virginica 10.36045 9.221802 10.059890 7.376660
Just add your function parameters in the **ply command. For example:
ddply(my_data, c("var1","var2"), my_function, param1=something, param2=something)
where my_function usually looks like
my_function(x, param1, param2)
Here's a working example of this:
require(plyr)
n=1000
my_data = data.frame(
subject=1:n,
city=sample(1:4, n, T),
gender=sample(1:2, n, T),
income=sample(50:200, n, T)
)
my_function = function(data_in, dv, extra=F){
dv = data_in[,dv]
output = data.frame(mean=mean(dv), sd=sd(dv))
if(extra) output = cbind(output, data.frame(n=length(dv), se=sd(dv)/sqrt(length(dv)) ) )
return(output)
}
#with params
ddply(my_data, c("city", "gender"), my_function, dv="income", extra=T)
city gender mean sd n se
1 1 1 127.1158 44.64347 95 4.580324
2 1 2 125.0154 44.83492 130 3.932283
3 2 1 130.3178 41.00359 107 3.963967
4 2 2 128.1608 43.33454 143 3.623816
5 3 1 121.1419 45.02290 148 3.700859
6 3 2 120.1220 45.01031 123 4.058443
7 4 1 126.6769 38.33233 130 3.361968
8 4 2 125.6129 44.46168 124 3.992777
#without params
ddply(my_data, c("city", "gender"), my_function, dv="income", extra=F)
city gender mean sd
1 1 1 127.1158 44.64347
2 1 2 125.0154 44.83492
3 2 1 130.3178 41.00359
4 2 2 128.1608 43.33454
5 3 1 121.1419 45.02290
6 3 2 120.1220 45.01031
7 4 1 126.6769 38.33233
8 4 2 125.6129 44.46168

Subsetting in a function to calculate a row total

I have a data frame with results for certain instruments, and I want to create a new column which contains the totals of each row. Because I have different numbers of instruments each time I run an analysis on new data, I need a function to dynamically calculate the new column with the Row Total.
To simply my problem, here’s what my data frame looks like:
Type Value
1 A 10
2 A 15
3 A 20
4 A 25
5 B 30
6 B 40
7 B 50
8 B 60
9 B 70
10 B 80
11 B 90
My goal is to achieve the following:
A B Total
1 10 30 40
2 15 40 55
3 20 50 70
4 25 60 85
5 70 70
6 80 80
7 90 90
I’ve tried various method, but this way holds the most promise:
myList <- list(a = c(10, 15, 20, 25), b = c(30, 40, 50, 60, 70, 80, 90))
tmpDF <- data.frame(sapply(myList, '[', 1:max(sapply(myList, length))))
> tmpDF
a b
1 10 30
2 15 40
3 20 50
4 25 60
5 NA 70
6 NA 80
7 NA 90
totalSum <- rowSums(tmpDF)
totalSum <- data.frame(totalSum)
tmpDF <- cbind(tmpDF, totalSum)
> tmpDF
a b totalSum
1 10 30 40
2 15 40 55
3 20 50 70
4 25 60 85
5 NA 70 NA
6 NA 80 NA
7 NA 90 NA
Even though this way did succeeded in combining two data frames of different lengths, the ‘rowSums’ function gives the wrong values in this example. Besides that, my original data isn't in a list format, so I can't apply such a 'solution'.
I think I’m overcomplicating this problem, so I was wondering how can I …
Subset data from a data frame on the basis of ‘Type’,
Insert these individual subsets of different lengths into a new data frame,
Add an ‘Total’ column to this data frame which is the correct sum of the
individual subsets.
An added complication to this problem is that this needs to be done in an function or in an otherwise dynamic way, so that I don’t need to manually subset the dozens of ‘Types’ (A, B, C, and so on) in my data frame.
Here’s what I have so far, which doesn’t work, but illustrates the lines I’m thinking along:
TotalDf <- function(x){
tmpNumberOfTypes <- c(levels(x$Type))
for( i in tmpNumberOfTypes){
subSetofData <- subset(x, Type = i, select = Value)
if( i == 1) {
totalDf <- subSetOfData }
else{
totalDf <- cbind(totalDf, subSetofData)}
}
return(totalDf)
}
Thanks in advance for any thoughts or ideas on this,
Regards,
EDIT:
Thanks to the comment of Joris (see below) I got an end in the right direction, however, when trying to translate his solution to my data frame, I run into additional problems. His proposed answer works, and gives me the following (correct) sum of the values of A and B:
> tmp78 <- tapply(DF$value,DF$id,sum)
> tmp78
1 2 3 4 5 6
6 8 10 12 9 10
> data.frame(tmp78)
tmp78
1 6
2 8
3 10
4 12
5 9
6 10
However, when I try this solution on my data frame, it doesn’t work:
> subSetOfData <- copyOfTradesList[c(1:3,11:13),c(1,10)]
> subSetOfData
Instrument AccountValue
1 JPM 6997
2 JPM 7261
3 JPM 7545
11 KFT 6992
12 KFT 6944
13 KFT 7069
> unlist(sapply(rle(subSetOfData$Instrument)$lengths,function(x) 1:x))
Error in rle(subSetOfData$Instrument) : 'x' must be an atomic vector
> subSetOfData$InstrumentNumeric <- as.numeric(subSetOfData$Instrument)
> unlist(sapply(rle(subSetOfData$InstrumentNumeric)$lengths,function(x) 1:x))
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
> subSetOfData$id <- unlist(sapply(rle(subSetOfData$InstrumentNumeric)$lengths,function(x) 1:x))
Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 2L, 3L, 1L, 2L, :
replacement has 3 rows, data has 6
I have the disturbing idea that I’m going around in circles…
Two thoughts :
1) you could use na.rm=T in rowSums
2) How do you know which one has to go with which? You might add some indexing.
eg :
DF <- data.frame(
type=c(rep("A",4),rep("B",6)),
value = 1:10,
stringsAsFactors=F
)
DF$id <- unlist(lapply(rle(DF$type)$lengths,function(x) 1:x))
Now this allows you to easily tapply the sum on the original dataframe
tapply(DF$value,DF$id,sum)
And, more importantly, get your dataframe in the correct form :
> DF
type value id
1 A 1 1
2 A 2 2
3 A 3 3
4 A 4 4
5 B 5 1
6 B 6 2
7 B 7 3
8 B 8 4
9 B 9 5
10 B 10 6
> library(reshape)
> cast(DF,id~type)
id A B
1 1 1 5
2 2 2 6
3 3 3 7
4 4 4 8
5 5 NA 9
6 6 NA 10
TV <- data.frame(Type = c("A","A","A","A","B","B","B","B","B","B","B")
, Value = c(10,15,20,25,30,40,50,60,70,80,90)
, stringsAsFactors = FALSE)
# Added Type C for testing
# TV <- data.frame(Type = c("A","A","A","A","B","B","B","B","B","B","B", "C", "C", "C")
# , Value = c(10,15,20,25,30,40,50,60,70,80,90, 100, 150, 130)
# , stringsAsFactors = FALSE)
lnType <- with(TV, tapply(Value, Type, length))
lnType <- as.integer(lnType)
lnType
id <- unlist(mapply(FUN = rep_len, length.out = lnType, x = list(1:max(lnType))))
(TV <- cbind(id, TV))
require(reshape2)
tvWide <- dcast(TV, id ~ Type)
# Alternatively
# tvWide <- reshape(data = TV, direction = "wide", timevar = "Type", ids = c(id, Type))
tvWide <- subset(tvWide, select = -id)
# If you want something neat without the <NA>
# for(i in 1:ncol(tvWide)){
#
# if (is.na(tvWide[j,i])){
# tvWide[j,i] = 0
# }
#
# }
# }
tvWide
transform(tvWide, rowSum=rowSums(tvWide, na.rm = TRUE))