Julia: Detect and remove duplicate rows from array? - duplicates

What is the best method of detecting and dropping duplicate rows from an array in Julia?
x = Integer.(round.(10 .* rand(1000,4)))
# In R I would apply the duplicated function.
x = x[duplicated(x),:]

unique is what you are looking for: (this does not answer the question for the detection part.)
julia> x = Integer.(round.(10 .* rand(1000,4)))
1000×4 Array{Int64,2}:
7 3 10 1
7 4 8 9
7 7 3 0
3 4 8 2
⋮
julia> unique(x, 1)
973×4 Array{Int64,2}:
7 3 10 1
7 4 8 9
7 7 3 0
3 4 8 2
⋮
As for the detection part, a dirty fix would be editing this line:
#nref $N A d->d == dim ? sort!(uniquerows) : (indices(A, d))
to:
(#nref $N A d->d == dim ? sort!(uniquerows) : (indices(A, d))), uniquerows
Alternatively, you could define your own unique2 with abovementioned changes:
using Base.Cartesian
import Base.Prehashed
#generated function unique2(A::AbstractArray{T,N}, dim::Int) where {T,N}
......
end
julia> y, idx = unique2(x, 1)
julia> y
960×4 Array{Int64,2}:
8 3 1 5
8 3 1 6
1 1 0 1
8 10 1 10
9 1 8 7
⋮
julia> setdiff(1:1000, idx)
40-element Array{Int64,1}:
99
120
132
140
216
227
⋮
The benchmark on my machine is:
x = rand(1:10,1000,4) # 48 dups
#btime unique2($x, 1);
124.342 μs (31 allocations: 145.97 KiB)
#btime duplicated($x);
407.809 μs (9325 allocations: 394.78 KiB)
x = rand(1:4,1000,4) # 751 dups
#btime unique2($x, 1);
66.062 μs (25 allocations: 50.30 KiB)
#btime duplicated($x);
222.337 μs (4851 allocations: 237.88 KiB)
The result shows that the convoluted-metaprogramming-hashtable way in Base benefits a lot from lower memory allocation.

Julia v1.4 and above you would need to type unique(a, dims=1) where a is your N by 2 Array
julia> a=[2 2 ; 2 2; 1 2; 3 1]
4×2 Array{Int64,2}:
2 2
2 2
1 2
3 1
julia> unique(a,dims=1)
3×2 Array{Int64,2}:
2 2
1 2
3 1

You can also go with:
duplicated(x) = foldl(
(d,y)->(x[y,:] in d[1] ? (d[1],push!(d[2],y)) : (push!(d[1],x[y,:]),d[2])),
(Set(), Vector{Int}()),
1:size(x,1))[2]
This collects a set of seen rows, and outputs the indices of those already seen. This is essentially the minimal effort needed to get the result, so it should be fast.
julia> x = rand(1:2,5,2)
5×2 Array{Int64,2}:
2 1
1 2
1 2
1 1
1 1
julia> duplicated(x)
2-element Array{Int64,1}:
3
5
julia> x[duplicated(x),:]
2×2 Array{Int64,2}:
1 2
1 1

Related

Applying 'vector' of functions on a Matlab matrix

Let's say I've got a matrix with n columns, and I've got n different functions.
Is it possible to apply i-th function per each element in i-th column efficiently, that is without using loop?
For example for the following variables:
funs = #(x) [x, cos(x), x.^2]
A = [1 0 1
2 0 2
3 0 3
4 0 4] ;
I would like to obtain the following result:
B = [1 1 1
2 1 4
3 1 9
4 1 16] ;
without looping through columns...

Why happened out of bound error in for loop?

Out of bound error occured.
This is Octave language.
for ii=1:1:10
m(ii)=ii*8
q=m(ii)
if (ii>=2)
q(ii).xdot=(q(ii).x-q(ii-1).x)/Ts;
end
end
But error says
q(2): out of bound 1
How can I fixed it?
For this type of assignment you do not need a loop and anyway you need to define Ts.
To calculate differential increase you can use diff
x=(1:1:10)*8
x =
8 16 24 32 40 48 56 64 72 80
octave:5> Ts=2
Ts = 2
octave:6> xdot=diff(x)/Ts
xdot =
4 4 4 4 4 4 4 4 4
octave:7> size(x)
ans =
1 10
octave:8> size(xdot)
ans =
1 9

Vectorization of feature scaling

I want to feature scale a matrix (X) with 2 columns. I am using mean normalization, and I wrote the following lines in Octave:
X_norm = X
mu = mean(X);
sigma = std(X);
X_norm(:,1) = (X_norm(:,1) .- mu(:,1)) ./ sigma(:,1);
X_norm(:,2) = (X_norm(:,2) .- mu(:,2)) ./ sigma(:,2);
Can you please let me know a cleaner way to vectorize these calculation?
I checked my code by comparing with the result from zscore(X) and they matched - i.e. a sum(X_norm - zscore(X)) returned me 0 0.
I am constrained to not use zscore(), and hence the question.
Sample data as follows:
2104 3
1600 3
2400 3
1416 2
3000 4
1985 4
1534 3
1427 3
1380 3
1494 3
1940 4
2000 3
1890 3
4478 5
1268 3
2300 4
1320 2
1236 3
2609 4
3031 4
1767 3
1888 2
1604 3
1962 4
3890 3
1100 3
1458 3
2526 3
2200 3
2637 3
You could simply do:
X_norm = (X .- mean(X,1)) ./ std(X,0,1);
During cross validation faced zero division issue.
This worked for me.
mu = mean(X);
X_norm = X - mu;
sigma = std(X);
% Skip zero div
sigmaZeroIdx = sigma == 0;
sigma(1,sigmaZeroIdx) = 1;
X_norm = X_norm ./ sigma;
I think you could apply a for loop for N size of features.
X_norm = X;
mu = zeros(1, size(X, 2));
sigma = zeros(1, size(X, 2));
for iter = 1:num_iters;
mu(1,iter) = mean(X_norm(:,iter));
X_norm(:,iter) = X_norm(:,iter) .- mu(1,iter);
sigma(1,iter) = std(X_norm(:,iter));
X_norm(:,iter) = X_norm(:,iter) ./ mu(1,iter);
end

How does mapcat work?

I am really new to clojure! How does `mapcat work?
mapcat function is just a shortcut for applying concat function to the result of map function:
=> (mapcat reverse [[3 2 1 0] [6 5 4] [9 8 7]])
(0 1 2 3 4 5 6 7 8 9)
=> (apply concat (map reverse [[3 2 1 0] [6 5 4] [9 8 7]]))
(0 1 2 3 4 5 6 7 8 9)
References:
official Clojure API docs
clojuredocs.org website
By using mapcat in combination with vector function you can interlace several collections:
=> (mapcat vector [1 2 3 4 5 6] [:q :w :e :r :t :y])
(1 :q 2 :w 3 :e 4 :r 5 :t 6 :y)
You'll get the same result using list function instead of vector.

R question about sapply /plyr syntax: how to pass variable values to a function

Is there a way to pass a variable value in ddply/sapply directly to a function without the function (x) notation?
E.g. Instead of:
ddply(bu,.(trial), function (x) print(x$tangle) )
Is there a way to do:
ddply(bu,.(trial), print(tangle) )
I am asking because with many variables this notation becomes very cumbersome.
Thanks!
You can use fn$ in the gsubfn package. Just preface the function in question with fn$ and then you can use a formula notation as shown here:
> library(gsubfn)
>
> # instead of specifying function(x) mean(x) / sd(x)
>
> fn$sapply(iris[-5], ~ mean(x) / sd(x))
Sepal.Length Sepal.Width Petal.Length Petal.Width
7.056602 7.014384 2.128819 1.573438
> library(plyr)
> # instead of specifying function(x) colMeans(x[-5]) / sd(x[-5])
>
> fn$ddply(iris, .(Species), ~ colMeans(x[-5]) / sd(x[-5]))
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 14.20183 9.043319 8.418556 2.334285
2 versicolor 11.50006 8.827326 9.065547 6.705345
3 virginica 10.36045 9.221802 10.059890 7.376660
Just add your function parameters in the **ply command. For example:
ddply(my_data, c("var1","var2"), my_function, param1=something, param2=something)
where my_function usually looks like
my_function(x, param1, param2)
Here's a working example of this:
require(plyr)
n=1000
my_data = data.frame(
subject=1:n,
city=sample(1:4, n, T),
gender=sample(1:2, n, T),
income=sample(50:200, n, T)
)
my_function = function(data_in, dv, extra=F){
dv = data_in[,dv]
output = data.frame(mean=mean(dv), sd=sd(dv))
if(extra) output = cbind(output, data.frame(n=length(dv), se=sd(dv)/sqrt(length(dv)) ) )
return(output)
}
#with params
ddply(my_data, c("city", "gender"), my_function, dv="income", extra=T)
city gender mean sd n se
1 1 1 127.1158 44.64347 95 4.580324
2 1 2 125.0154 44.83492 130 3.932283
3 2 1 130.3178 41.00359 107 3.963967
4 2 2 128.1608 43.33454 143 3.623816
5 3 1 121.1419 45.02290 148 3.700859
6 3 2 120.1220 45.01031 123 4.058443
7 4 1 126.6769 38.33233 130 3.361968
8 4 2 125.6129 44.46168 124 3.992777
#without params
ddply(my_data, c("city", "gender"), my_function, dv="income", extra=F)
city gender mean sd
1 1 1 127.1158 44.64347
2 1 2 125.0154 44.83492
3 2 1 130.3178 41.00359
4 2 2 128.1608 43.33454
5 3 1 121.1419 45.02290
6 3 2 120.1220 45.01031
7 4 1 126.6769 38.33233
8 4 2 125.6129 44.46168