implementation of the Gower distance function - function

I have a matrix (size: 28 columns and 47 rows) with numbers. This matrix has an extra row that is contains headers for the columns ("ordinal" and "nominal").
I want to use the Gower distance function on this matrix. Here says that:
The final dissimilarity between the ith and jth units is obtained as a weighted sum of dissimilarities for each variable:
d(i,j) = sum_k(delta_ijk * d_ijk ) / sum_k( delta_ijk )
In particular, d_ijk represents the distance between the ith and jth unit computed considering the kth variable. It depends on the nature of the variable:
factor or character columns are
considered as categorical nominal
variables and d_ijk = 0 if
x_ik =x_jk, 1 otherwise;
ordered columns are considered as
categorical ordinal variables and
the values are substituted with the
corresponding position index, r_ik in
the factor levels. These position
indexes (that are different from the
output of the R function rank) are
transformed in the following manner
z_ik = (r_ik - 1)/(max(r_ik) - 1)
These new values, z_ik, are treated as observations of an
interval scaled variable.
As far as the weight delta_ijk is concerned:
delta_ijk = 0 if x_ik = NA or x_jk =
NA;
delta_ijk = 1 in all the other cases.
I know that there is a gower.dist function, but I must do it that way.
So, for "d_ijk", "delta_ijk" and "z_ik", I tried to make functions, as I didn't find a better way.
I started with "delta_ijk" and i tried this:
Delta=function(i,j){for (i in 1:28){for (j in 1:47){
+{if (MyHeader[i,j]=="nominal")
+ result=0
+{else if (MyHeader[i,j]=="ordinal") result=1}}}}
+;result}
But I got error. So I got stuck and I can't do the rest.
P.S. Excuse me if I make mistakes, but English is not a language I very often.

Why do you want to reinvent the wheel billyt? There are several functions/packages in R that will compute this for you, including daisy() in package cluster which comes with R.
First things first though, get those "data type" headers out of your data. If this truly is a matrix then character information in this header row will make the whole matrix a character matrix. If it is a data frame, then all columns will likely be factors. What you want to do is code the type of data in each column (component of your data frame) as 'factor' or 'ordered'.
df <- data.frame(A = c("ordinal",1:3), B = c("nominal","A","B","A"),
C = c("nominal",1,2,1))
Which gives this --- note that all are stored as factors because of the extra info.
> head(df)
A B C
1 ordinal nominal nominal
2 1 A 1
3 2 B 2
4 3 A 1
> str(df)
'data.frame': 4 obs. of 3 variables:
$ A: Factor w/ 4 levels "1","2","3","ordinal": 4 1 2 3
$ B: Factor w/ 3 levels "A","B","nominal": 3 1 2 1
$ C: Factor w/ 3 levels "1","2","nominal": 3 1 2 1
If we get rid of the first row and recode into the correct types, we can compute Gower's coefficient easily.
> headers <- df[1,]
> df <- df[-1,]
> DF <- transform(df, A = ordered(A), B = factor(B), C = factor(C))
> ## We've previously shown you how to do this (above line) for lots of columns!
> str(DF)
'data.frame': 3 obs. of 3 variables:
$ A: Ord.factor w/ 3 levels "1"<"2"<"3": 1 2 3
$ B: Factor w/ 2 levels "A","B": 1 2 1
$ C: Factor w/ 2 levels "1","2": 1 2 1
> require(cluster)
> daisy(DF)
Dissimilarities :
2 3
3 0.8333333
4 0.3333333 0.8333333
Metric : mixed ; Types = O, N, N
Number of objects : 3
Which gives the same as gower.dist() for this data (although in a slightly different format (as.matrix(daisy(DF))) would be equivalent):
> gower.dist(DF)
[,1] [,2] [,3]
[1,] 0.0000000 0.8333333 0.3333333
[2,] 0.8333333 0.0000000 0.8333333
[3,] 0.3333333 0.8333333 0.0000000
You say you can't do it this way? Can you explain why not? As you seem to be going to some degree of effort to do something that other people have coded up for you already. This isn't homework, is it?

I'm not sure what your logic is doing, but you are putting too many "{" in there for your own good. I generally use the {} pairs to surround the consequent-clause:
Delta=function(i,j){for (i in 1:28) {for (j in 1:47){
if (MyHeader[i,j]=="nominal") {
result=0
# the "{" in the next line before else was sabotaging your efforts
} else if (MyHeader[i,j]=="ordinal") { result=1} }
result}
}

Thanks Gavin and DWin for your help. I managed to solve the problem and find the right distance matrix. I used daisy() after I recoded the class of the data and it worked.
P.S. The solution that you suggested at my other topic for changing the class of the columns:
DF$nominal <- as.factor(DF$nominal)
DF$ordinal <- as.ordered(DF$ordinal)
didn't work. It changed only the first nominal and ordinal column.
Thanks again for your help.

Related

How to calculate Moran's I and GWR when given duplicated factor data

I'm trying to do the statistical analysis by using the Moran's I, but it leads me into a serious problem.
Suppose the data look like:
y
indep1
indep2
coord_x
coord_y
District
y1
indep1 1
indep2 1
coord_x 1
coord_y 1
A
y2
indep1 2
indep2 2
coord_x 1
coord_y 1
A
y3
indep1 3
indep2 3
coord_x 1
coord_y 1
A
y4
indep1 4
indep2 4
coord_x 2
coord_y 2
B
y5
indep1 5
indep2 5
coord_x 2
coord_y 2
B
Note that the data is given as form of *.shp.
This dataframe has 2 unique districts, but total 5 rows of data. If we want to calculate the Moran's I, then the weighted matrix W is 2 by 2.
But If I run the code
ols <- lm(y~indep1+indep2, data=dataset)
We cannot calculate the
lm.morantest(ols, w) #w:weighted matrix
# returns "different length"
How can we solve that problem? IF the total number of data and the number of unique districts are different, then how can we check the spatial auto-correlation between the districts and how can we apply GWR(geographically weighted regression?) Any reference papers or your advises could be helpful.
Thank you for your help in advance.
I tried to calculate the moran's I by making the number our dependent variables same as the number of unique districts. It can be possible when aggregating the sum of dependent variables with respect to the districts.
library(reshape2)
y_agg <- dcast(shp#data, district~., value.var="dependent variable", sum)
y_agg <- y_agg$.
moran.test(y_agg, W)
But I think it is not the right way to analyze the spatial regression since all the other independent variables are ignored. How can I solve that problem? Is there any way for solving that problem without making the aggregated independent variables of my data?
Thank you.

how to print remainders from last to first?

(this is a theoretical question, and therefore it's not related to a particular programming language).
I'm trying to figure out a way to design an algorithm to print a number in binary (i.e. from decimal to binary), the request is easy and it consists of taking the remainders in the opposite order (i.e. remainders of successive divisions).
the first part of the question is easy, for example:
4 / 2 = 2, and 4 % 2 = 0,
2 / 2 = 1, and 2 % 2 = 0,
1 / 2 = 0, and 1 % 2 = 1.
problems comes in the second part: take each digit from the last to the first, I have no clue.

pattern of recursively splitting an array to odd and even elements

I am looking to find a pattern to recursively split an array to odd and even elements. I will try to describe the problem in the following:
suppose we have an array of length 16 as:
a=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
First iteration: splitting in odd and even
[0,2,4,6,8,10,12,14]
[1,3,5,7,9,11,13,15]
which basically are
a[2i] for i=0:7
a[2i+1] for i=0:7
splitting each of these arrays into odd and even elements again we have
[0,4,8,12]
[2,6,10,14]
[1,5,9,13]
[3,7,11,15]
that similarly are
4i for i=0:3
4i+2
4i+1
4i+3
splitting again the array elements would be
[0,8]
[4,12]
.
.
[1,9]
or
8i for i=0:1
8i+4
8i+2
8i+6
8+1
8i+5
8i+3
8i+1
Arrays needed to split recursively until each array has only two elements.
One things that I noticed that the bottom half is similar to the top one and we just need to add "1" to the index terms
I was wondering how Can I find the pattern for an array with an "n" elements?
Thank you very much for your time.
assuming your number n is a power of 2 (aka 2^k):
then you will have m = n/2 = 2^(k-1) arrays with following numbers for i in {0,1}:
0: m*i+f(0)
1: m*i+f(1)
...
j: m*i+f(j)
...
m-1: m*i+f(m-1)
where f(x) is a function which takes an integer (x), transforms it into an k-1-bit binary number (b), reverses it (rb) and returns its decimal value (y).
Example for k=4 (which looks a lot like your values):
x
b
rb
f(x)=y
0
000
000
0
1
001
100
4
2
010
010
2
3
011
110
6
4
100
001
1
5
101
101
5
6
110
011
3
7
111
111
7

KDB: apply dyadic function across two lists

Consider a function F[x;y] that generates a table. I also have two lists; xList:[x1;x2;x3] and yList:[y1;y2;y3]. What is the best way to do a simple comma join of F[x1;y1],F[x1;y2],F[x1;y3],F[x2;y1],..., thereby producing one large table?
You have asked for the cross product of your argument lists, so the correct answer is
raze F ./: xList cross yList
Depending on what you are doing, you might want to look into having your function operate on the entire list of x and the entire list of y and return a table, rather than on each pair and then return a list of tables which has to get razed. The performance impact can be considerable, for example see below
q)g:{x?y} //your core operation
q)//this takes each pair of x,y, performs an operation and returns a table for each
q)//which must then be flattened with raze
q)fm:{flip `x`y`res!(x;y; enlist g[x;y])}
q)//this takes all x, y at once and returns one table
q)f:{flip `x`y`res!(x;y;g'[x;y])}
q)//let's set a seed to compare answers
q)\S 1
q)\ts do[10000;rm:raze fm'[x;y]]
76 2400j
q)\S 1
q)\ts do[10000;r:f[x;y]]
22 2176j
q)rm~r
1b
Setup our example
q)f:{([] total:enlist x+y; x:enlist x; y:enlist y)}
q)x:1 2 3
q)y:4 5 6
Demonstrate F[x1;y1]
q)f[1;4]
total x y
---------
5 1 4
q)f[2;5]
total x y
---------
7 2 5
Use the multi-valent apply operator together with each' to apply to each pair of arguments.
q)raze .'[f;flip (x;y)]
total x y
---------
5 1 4
7 2 5
9 3 6
Another way to achieve it using each-both :
x: 1 2 3
y: 4 5 6
f:{x+y}
f2:{ a:flip x cross y ; f'[a 0;a 1] }
f2[x;y]
5j, 6j, 7j, 6j, 7j, 8j, 7j, 8j, 9j

Define quadrant based on positive/negative values in two columns

I have a data set with two columns of positive and negative numbers. I would like to create a third column that reflects which quadrant they would appear in if plotted in Cartesian space.
For example, if Column A is positive, and Column B is positive, then Column C would record "I." If column A is negative, and Column B is negative, then Column C would record "III," and so on.
I suspect I can do this with an if else function and then loop or apply it across rows in the data set, but my attempts to write the if else have so far failed.
Well, the following would give you values between 1 and 4:
C <- (A<0) + (B<0)*2L + 1L
This transforms the whole column in one go. The magic lies in that FALSE/TRUE is treated as 0/1, and you can do math on it. By using 2L and 1L instead of 2 and 1, you keep the result as integers instead of forcing a coercion to doubles (which is slower and takes more memory).
Then assuming you want to map to these quadrants:
+B
|
II | I
-A -----+---- +A
III | IV
|
-B
You could use this (updated to use a data.frame):
# Sample data.frame with columns a & b
d <- data.frame(a=c(1,-1,-1,1), b=c(1,1,-1,-1))
quadrantNames <- c('I', 'II', 'IV', 'III') # Your labels...
d <- within(d, c <- quadrantNames[(a<0) + (b<0)*2L + 1L])
head(d) # print data
a b c
1 1 1 I
2 -1 1 II
3 -1 -1 III
4 1 -1 IV
...and if you want the quadrants mapped differently, just change the order of the labels in quadrantNames.