How to calculate Moran's I and GWR when given duplicated factor data - regression

I'm trying to do the statistical analysis by using the Moran's I, but it leads me into a serious problem.
Suppose the data look like:
y
indep1
indep2
coord_x
coord_y
District
y1
indep1 1
indep2 1
coord_x 1
coord_y 1
A
y2
indep1 2
indep2 2
coord_x 1
coord_y 1
A
y3
indep1 3
indep2 3
coord_x 1
coord_y 1
A
y4
indep1 4
indep2 4
coord_x 2
coord_y 2
B
y5
indep1 5
indep2 5
coord_x 2
coord_y 2
B
Note that the data is given as form of *.shp.
This dataframe has 2 unique districts, but total 5 rows of data. If we want to calculate the Moran's I, then the weighted matrix W is 2 by 2.
But If I run the code
ols <- lm(y~indep1+indep2, data=dataset)
We cannot calculate the
lm.morantest(ols, w) #w:weighted matrix
# returns "different length"
How can we solve that problem? IF the total number of data and the number of unique districts are different, then how can we check the spatial auto-correlation between the districts and how can we apply GWR(geographically weighted regression?) Any reference papers or your advises could be helpful.
Thank you for your help in advance.
I tried to calculate the moran's I by making the number our dependent variables same as the number of unique districts. It can be possible when aggregating the sum of dependent variables with respect to the districts.
library(reshape2)
y_agg <- dcast(shp#data, district~., value.var="dependent variable", sum)
y_agg <- y_agg$.
moran.test(y_agg, W)
But I think it is not the right way to analyze the spatial regression since all the other independent variables are ignored. How can I solve that problem? Is there any way for solving that problem without making the aggregated independent variables of my data?
Thank you.

Related

Multinomial Logistic Regression Predictors Set Up

I would like to use a multinomial logistic regression to get win probabilities for each of the 5 horses that participate in any given race using each horses previous average speed.
RACE_ID H1_SPEED H2_SPEED H3_SPEED H4_SPEED H5_SPEED WINNING_HORSE
1 40.482081 44.199627 42.034929 39.004813 43.830139 5
2 39.482081 42.199627 41.034929 41.004813 40.830139 4
I am stuck on how to handle the independent variables for each horse given that any of the 5 horses average speed can be placed in any of H1_SPEED through H5_SPEED.
Given the fact that for each race I can put any of the 5 horses under H1_SPEED meaning there is no real relationship between H1_SPEED from RACE_ID 1 and H1_SPEED from RACE_ID 2 other than the arbitrary position I selected.
Would there be any difference if the dataset looked like this -
For RACE_ID 1 I swapped H3_SPEED and H5_SPEED and changed WINNING_HORSE from 5 to 3
For RACE_ID 2 I swapped H4_SPEED and H1_SPEED and changed WINNING_HORSE from 4 to 1
RACE_ID H1_SPEED H2_SPEED H3_SPEED H4_SPEED H5_SPEED WINNING_HORSE
1 40.482081 44.199627 43.830139 39.004813 42.034929 3
2 41.004813 42.199627 41.034929 39.482081 40.830139 1
Is this an issue, if so how should this be handled? What if I wanted to add more independent features per horse?
You cannot change in that way your dataset, because each feature (column) has a meaning and probably it depends on the values of the other features. You can imagine it as a six dimensional hyperplane, if you change the value of a feature the position of the point in the hyperplane changes, it does not remain stationary.
If you deem that a feature is useless to solve your problem (i.e. it is independent from the target), you can drop it or avoid to use it during the training phase of your model.
Edit
To solve your specific problem you may add a parameter for each speed column that takes care of the specific horse which is running with that speed. It is a sort of data augmentation, in order to add more problem related features to your model.
RACE_ID H1_SPEED H1_HORSE H2_SPEED H2_HORSE ... WINNING_HORSE
1 40.482081 1 44.199627 2 ... 5
2 39.482081 3 42.199627 5 ... 4
I've invented the number associated to each horse, but it seems that this information is present in your dataset.

How to perform a many-to-many or (at least) a outer-join in SPSS

usually I use [R] for my data analysis, but these days I have to use SPSS. I was expecting that data manipulation might get a little bit more difficult this way, but after my first day I kind of surrender :D and I really would appreciate some help ...
My problem is the following:
I have two data sets, which have an ID number. Neither data sets have a unique ID (in one data set, which should have unique IDs, there is kind of a duplicated row)
In a perfect world I would like to keep this duplicated row and simply perform a many-to-many-join. But I accepted, that I might have to delete this "bad" row (in dataset A) and perform a 1:many-join (join dataset B to dataset A, which contains the unique IDs).
If I run the join (and accept that it seems not to be possible to run a 1:many, but only a many:1-join), I have the problem, that I lose IDs. If I join dataset A to dataset B I lose all cases, that are not part of dataset B. But I really would like to have both IDs like in a full join or something.
Do you know if there is (kind of) a simple solution to my problem?
Example:
dataset A:
ID
VAL1
1
A
1
B
2
D
3
K
4
A
dataset B:
ID
VAL2
1
g
2
k
4
a
5
c
5
d
5
a
2
x
expected result (best solution):
ID
VAL1
VAL2
1
A
g
1
B
g
2
D
k
3
K
NA
4
A
a
2
D
x
expected result (second best solution):
ID
VAL1
VAL2
1
A
g
2
D
k
3
K
NA
4
A
a
5
NA
c
5
NA
d
5
NA
a
2
D
x
what I get (worst solution):
ID
VAL1
VAL2
1
A
g
2
D
k
4
A
a
5
NA
c
5
NA
d
5
NA
a
2
D
x
From your example It looks like what you need is a full many to many join, based on the ID's existing in dataset A. You can get this by creating a full Cartesian-Product of the two dataset, using dataset A as the first\left dataset.
The following syntax assumes you have the STATS CARTPROD extention command installed. If you don't you can see here about installing it.
First I'll recreate your example to demonstrate on:
dataset close all.
data list list/id1 vl1 (2F3) .
begin data
1 232
1 433
2 456
3 246
4 468
end data.
dataset name aaa.
data list list/id2 vl2 (2F3) .
begin data
1 111
2 222
4 333
5 444
5 555
5 666
2 777
3 888
end data.
dataset name bbb.
Now the actual work is fairly simple:
DATASET ACTIVATE aaa.
STATS CARTPROD VAR1=id1 vl1 INPUT2=bbb VAR2=id2 vl2
/SAVE OUTFILE="C:\somepath\yourcartesianproduct.sav".
* The new dataset now contains all possible combinations of rows in the two datasets.
* we will select only the relevant combinations, where the two ID's match.
select if id1=id2.
exe.

How to Merge Column of Clusters and Columns with X-Y coord?

I have a column with the name of the points, a column with the X coordinates and a column with Y coordinates.
This is the tab on which I'm working:
I want to create a tab in which I have three columns, one with Clusters' ID, one with X coordinates and another with Y coordinates. And for each ckustr I want the X-Y coordinates.
I've tried the following code:
Xcoord <- sort(unique(tabprof$X_coord))
clusters <- sort(unique(tabprof$Cluster_ID))
I've tried do this in order to merge the two vectors, but it wasn't possible, because they have a different number of rows. It's probably due to the presence of clusters with the same X coord value.
Due to our talk in comments, I'll provide new solution. I'll use fake data.
A <- c(1,1,1,1,2,2,2,2)
B <- c(3,3,4,4,3,3,3,5)
df <- data.frame(A,B)
res <- unique(df)
> df
A B
1 1 3
2 1 3
3 1 4
4 1 4
5 2 3
6 2 3
7 2 3
8 2 5
> res
A B
1 1 3
3 1 4
5 2 3
8 2 5
So as you se if our A column is ClusterID and B X-coords, we have ClusterID duplicated but! we have unique coord to each one. What is more if two different IDs have the same coords it's no problem.
I hope that it'll helpful.

KDB: apply dyadic function across two lists

Consider a function F[x;y] that generates a table. I also have two lists; xList:[x1;x2;x3] and yList:[y1;y2;y3]. What is the best way to do a simple comma join of F[x1;y1],F[x1;y2],F[x1;y3],F[x2;y1],..., thereby producing one large table?
You have asked for the cross product of your argument lists, so the correct answer is
raze F ./: xList cross yList
Depending on what you are doing, you might want to look into having your function operate on the entire list of x and the entire list of y and return a table, rather than on each pair and then return a list of tables which has to get razed. The performance impact can be considerable, for example see below
q)g:{x?y} //your core operation
q)//this takes each pair of x,y, performs an operation and returns a table for each
q)//which must then be flattened with raze
q)fm:{flip `x`y`res!(x;y; enlist g[x;y])}
q)//this takes all x, y at once and returns one table
q)f:{flip `x`y`res!(x;y;g'[x;y])}
q)//let's set a seed to compare answers
q)\S 1
q)\ts do[10000;rm:raze fm'[x;y]]
76 2400j
q)\S 1
q)\ts do[10000;r:f[x;y]]
22 2176j
q)rm~r
1b
Setup our example
q)f:{([] total:enlist x+y; x:enlist x; y:enlist y)}
q)x:1 2 3
q)y:4 5 6
Demonstrate F[x1;y1]
q)f[1;4]
total x y
---------
5 1 4
q)f[2;5]
total x y
---------
7 2 5
Use the multi-valent apply operator together with each' to apply to each pair of arguments.
q)raze .'[f;flip (x;y)]
total x y
---------
5 1 4
7 2 5
9 3 6
Another way to achieve it using each-both :
x: 1 2 3
y: 4 5 6
f:{x+y}
f2:{ a:flip x cross y ; f'[a 0;a 1] }
f2[x;y]
5j, 6j, 7j, 6j, 7j, 8j, 7j, 8j, 9j

implementation of the Gower distance function

I have a matrix (size: 28 columns and 47 rows) with numbers. This matrix has an extra row that is contains headers for the columns ("ordinal" and "nominal").
I want to use the Gower distance function on this matrix. Here says that:
The final dissimilarity between the ith and jth units is obtained as a weighted sum of dissimilarities for each variable:
d(i,j) = sum_k(delta_ijk * d_ijk ) / sum_k( delta_ijk )
In particular, d_ijk represents the distance between the ith and jth unit computed considering the kth variable. It depends on the nature of the variable:
factor or character columns are
considered as categorical nominal
variables and d_ijk = 0 if
x_ik =x_jk, 1 otherwise;
ordered columns are considered as
categorical ordinal variables and
the values are substituted with the
corresponding position index, r_ik in
the factor levels. These position
indexes (that are different from the
output of the R function rank) are
transformed in the following manner
z_ik = (r_ik - 1)/(max(r_ik) - 1)
These new values, z_ik, are treated as observations of an
interval scaled variable.
As far as the weight delta_ijk is concerned:
delta_ijk = 0 if x_ik = NA or x_jk =
NA;
delta_ijk = 1 in all the other cases.
I know that there is a gower.dist function, but I must do it that way.
So, for "d_ijk", "delta_ijk" and "z_ik", I tried to make functions, as I didn't find a better way.
I started with "delta_ijk" and i tried this:
Delta=function(i,j){for (i in 1:28){for (j in 1:47){
+{if (MyHeader[i,j]=="nominal")
+ result=0
+{else if (MyHeader[i,j]=="ordinal") result=1}}}}
+;result}
But I got error. So I got stuck and I can't do the rest.
P.S. Excuse me if I make mistakes, but English is not a language I very often.
Why do you want to reinvent the wheel billyt? There are several functions/packages in R that will compute this for you, including daisy() in package cluster which comes with R.
First things first though, get those "data type" headers out of your data. If this truly is a matrix then character information in this header row will make the whole matrix a character matrix. If it is a data frame, then all columns will likely be factors. What you want to do is code the type of data in each column (component of your data frame) as 'factor' or 'ordered'.
df <- data.frame(A = c("ordinal",1:3), B = c("nominal","A","B","A"),
C = c("nominal",1,2,1))
Which gives this --- note that all are stored as factors because of the extra info.
> head(df)
A B C
1 ordinal nominal nominal
2 1 A 1
3 2 B 2
4 3 A 1
> str(df)
'data.frame': 4 obs. of 3 variables:
$ A: Factor w/ 4 levels "1","2","3","ordinal": 4 1 2 3
$ B: Factor w/ 3 levels "A","B","nominal": 3 1 2 1
$ C: Factor w/ 3 levels "1","2","nominal": 3 1 2 1
If we get rid of the first row and recode into the correct types, we can compute Gower's coefficient easily.
> headers <- df[1,]
> df <- df[-1,]
> DF <- transform(df, A = ordered(A), B = factor(B), C = factor(C))
> ## We've previously shown you how to do this (above line) for lots of columns!
> str(DF)
'data.frame': 3 obs. of 3 variables:
$ A: Ord.factor w/ 3 levels "1"<"2"<"3": 1 2 3
$ B: Factor w/ 2 levels "A","B": 1 2 1
$ C: Factor w/ 2 levels "1","2": 1 2 1
> require(cluster)
> daisy(DF)
Dissimilarities :
2 3
3 0.8333333
4 0.3333333 0.8333333
Metric : mixed ; Types = O, N, N
Number of objects : 3
Which gives the same as gower.dist() for this data (although in a slightly different format (as.matrix(daisy(DF))) would be equivalent):
> gower.dist(DF)
[,1] [,2] [,3]
[1,] 0.0000000 0.8333333 0.3333333
[2,] 0.8333333 0.0000000 0.8333333
[3,] 0.3333333 0.8333333 0.0000000
You say you can't do it this way? Can you explain why not? As you seem to be going to some degree of effort to do something that other people have coded up for you already. This isn't homework, is it?
I'm not sure what your logic is doing, but you are putting too many "{" in there for your own good. I generally use the {} pairs to surround the consequent-clause:
Delta=function(i,j){for (i in 1:28) {for (j in 1:47){
if (MyHeader[i,j]=="nominal") {
result=0
# the "{" in the next line before else was sabotaging your efforts
} else if (MyHeader[i,j]=="ordinal") { result=1} }
result}
}
Thanks Gavin and DWin for your help. I managed to solve the problem and find the right distance matrix. I used daisy() after I recoded the class of the data and it worked.
P.S. The solution that you suggested at my other topic for changing the class of the columns:
DF$nominal <- as.factor(DF$nominal)
DF$ordinal <- as.ordered(DF$ordinal)
didn't work. It changed only the first nominal and ordinal column.
Thanks again for your help.