Jaccard distance between tweets - json

I'm currently trying to measure the Jaccard Distance between tweets in a dataset
This is where the dataset is
http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Tweets.json
I've tried a few things to measure the distance
This is what I have so far
I saved the linked dataset to a file called Tweets.json
json_alldata <- fromJSON(sprintf("[%s]", paste(readLines(file("Tweets.json")),collapse=",")))
Then I converted json_alldata to tweet.features and got rid of the geo column
# get rid of geo column
tweet.features = json_alldata
tweet.features$geo <- NULL
These are what the first two tweets look like
tweet.features$text[1]
[1] "RT #ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"
> tweet.features$text[2]
[1] "RT #NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"
First thing I tried was using the method stringdist which is under the stringdist library
install.packages("stringdist")
library(stringdist)
#This works?
#
stringdist(tweet.features$text[1], tweet.features$text[2], method = "jaccard")
When I run that, I get
[1] 0.1621622
I'm not sure that's correct, though. A intersection B = 23, and A union B = 25. The Jaccard distance is A intersection B/A union B -- right? So by my calculation, the Jaccard distance should be 0.92?
So I figured I could do it by sets. Simply calculate intersection and union and divide
This is what I tried
# Jaccard distance is the intersection of A and B divided by the Union of A and B
#
#create set for First Tweet
A1 <- as.set(tweet.features$text[1])
A2 <- as.set(tweet.features$text[2])
When I try to do intersection, I get this: The output is just list()
Intersection <- intersect(A1, A2)
list()
When I try Union, I get this:
union(A1, A2)
[[1]]
[1] "RT #ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"
[[2]]
[1] "RT #NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"
This doesn't seem to be grouping the words into a single set.
I figured I'd be able to divide the intersection by the union. But I guess I would need the program to count the number or words in each set, then do the calculations.
Needless to say, I'm a bit stuck and I'm not sure if I'm on the right track.
Any help would be appreciated. Thank you.

intersect and union expect vectors (as.set does not exist). I think you want to compare words so you can use strsplit but the way the split is done belongs to you. An example below:
tweet.features <- list(tweet1="RT #ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston",
tweet2= "RT #NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston")
jaccard_i <- function(tw1, tw2){
tw1 <- unlist(strsplit(tw1, " |\\."))
tw2 <- unlist(strsplit(tw2, " |\\."))
i <- length(intersect(tw1, tw2))
u <- length(union(tw1, tw2))
list(i=i, u=u, j=i/u)
}
jaccard_i(tweet.features[[1]], tweet.features[[2]])
$i
[1] 20
$u
[1] 23
$j
[1] 0.8695652
Is this want you want?
The strsplit is here done for every space or dot. You may want to refine the split argument from strsplit and replace " |\\." for something more specific (see ?regex).

Related

Low power and singular fit in mixed model despite high number of observations and low number of factors

Despite I made 276 independent observations across 5 sites (lowest number of obs per site: 23), I get the singularity warning and low power to fit a model with one categorical factor (two levels) and one random factor (site). Anybody could tell why would that happen?
Here is a reproducible example:
dataset:
https://www.dropbox.com/s/azpnhnemgyo0i72/example.xlsx?dl=0
require(readxl)
require(lme4)
df <- data.frame((read_excel("Example.xlsx", sheet=1)))
m1 = lmer(ds.lac ~ sand.avail + (1|site), data = df)
summary(m1)
simr::powerSim(m1, nsim=100)
# results in:
# fit: see ?isSingular
# boundary (singular) fit: sSimulating:
# |============================Simulating:
# |============================Simulating:
# |============================
Power for predictor 'sand.avail', 9.00% (95% confidence interval): ( 4.20, 16.40)
# Test: Kenward Roger (package pbkrtest)
As suggested by #Jose, this really belongs on CrossValidated.
pd <- position_dodge(width=0.75)
gg1 <- ggplot(df,
aes(x = site, y = ds.lac, colour = sand.avail)) +
stat_sum(position = pd, alpha = 0.8)
ggsave("powerSim.png")
Also as suggested by Jose, there are a number of reasons this data set doesn't have as much information as you would like
most (75%) of the observations are 0 (mean(df$ds.lac==0) counts the proportion)
you only have all three sand.avail types available at one site, two of the sites only have one type available (thus there will be a lot of overlap between the random effect of site and the effect of sand.avail)
for modern mixed models, 5 levels of the grouping variable is generally considered a minimum
Furthermore, I have a concern about what powerSim is doing here; this seems to be a post hoc power analysis (i.e., estimating the power of a data set you have already gathered), which is widely criticized by statisticians. It would be perfectly reasonable to say "why isn't the effect of sand.avail significant for this model, fitted to these data?" (e.g. car::Anova(m1) gives p=0.66), but asking about the power for this data set is conceptually problematic.

Generate small-world model in igraph using Newman-Watts algorithm

I want to generate small-world networks using igraph, but not using "rewiring" as implemented in the watts.strogatz.game(). In the Newman variation all local links are fixed but a fixed number of random links are lifted and dropped randomly on the network at a fixed rate (basically adding "long-range" connections). I thought I could simply generate a lattice (e.g. g <- graph.lattice(length=20, dim=1, circular=TRUE)) and then put a classical random graph on top of that. However, I do not know how to do this using a graph as input argument. Or maybe it is possible to add random edges at a specified probability?
Any help highly appreciated.
Thanks a lot!
Use graph.lattice to generate a lattice, then erdos.renyi.game with the same number of vertices and a fixed probability to generate a random graph. Then you can combine the two graphs using the %u% (union) operator. There is a small chance for multi-edges if the same edge happens to be the part of the lattice and the random graph as well, so you should also call simplify() on the union if you don't want that.
This seems to do the trick, in case anyone is interested. Just have to create a function to do this "rewiring" over and over again. Many thanks again, Tamas!
library(igraph)
g <- graph.lattice(length=100, dim=1, circular=TRUE)
g2 <- erdos.renyi.game(100, 1/100)
g3 <- g %u% g2
g3 <- simplify(g3)
plot.igraph(g3, vertex.size = 1,vertex.label = NA, layout=layout_in_circle)

How to use the subset parameter of igraph's maximal.cliques function?

I have a large graph and I would to find the maximal clique involving a pair of vertices. I thought that the subset argument to igraph's maximal.clique function would do this, but either I'm using it wrong it or it does something completely different. I've spent a fair amount of time searching the web without luck.
Here's a minimal example showing the problem:
> library(igraph)
> packageVersion('igraph')
[1] ‘1.0.1’
> g = graph.empty(n=10, directed=FALSE)
> g = add.edges(g, c(1, 2))
> str(g)
IGRAPH U--- 10 1 --
+ edge:
[1] 1--2
> # This correctly results a clique.
> maximal.cliques(g, min=2)
[[1]]
+ 2/10 vertices:
[1] 2 1
> # These don't return anything!
> maximal.cliques(g, min=2, subset=1)
list()
> maximal.cliques(g, min=2, subset=c(1, 2))
list()
The subset argument is not for calculating the maximal cliques on a subset of the graph; it simply restricts the set of vertices that are used as starting points in the course of the Bron-Kerbosch algorithm when finding maximal cliques. The Bron-Kerbosch algorithm itself still searches the entire graph and is allowed to add or remove vertices from the current set that it considers as it pleases.
The only role of the subset argument is that it allows you to parallelize the maximal cliques computation on large graphs by partitioning the vertex set of the graph into a number of subsets and then running maximal.cliques on multiple CPUs or CPU cores with different subsets. It is not guaranteed that a maximal clique will be found if the starting subset includes any or all of its vertices; for instance, on my machine, the maximal clique 1--2 is found if I use a starting subset consisting of vertex 9 only:
> maximal.cliques(g, subset=c(9))
[[1]]
+ 2/10 vertices:
[1] 2 1
If you want to search for maximal cliques in a subgraph of the original graph, use induced_subgraph first, followed by max_cliques.

Problems with results when using betweenness with igraph

I am currently having problems with the results when using betweenness in igraph.
I have created the following network (which is a star network with node 1 in the center)
id <- c(1,1,1,1,1)
rv <- c(2,3,4,5,6)
df <- as.data.frame(cbind(id,rv))
Then I calculate the betweenness for each node and add it to a dataframe:
g3=graph.data.frame(df, directed=TRUE)
bFrame<-as.data.frame(as.table(betweenness(g3)))
The problem is: if you use directed=FALSE you will get that node 1 has a centrality of 10 which makes sense. If you on the other hand use directed=TRUE I get that 1 has a centrality of 0.
In consequence I have two questions:
1. Why is the centrality in the second condition 0?
2. Shouldn't it be 2* the value of the undirected condition? https://en.wikipedia.org/wiki/Betweenness_centrality
Thanks in advance
Pavel

igraph - How to calculate closeness method in iGraph to disconnected graphs

I use igraph in R for calculate graph measure, my graph make in a PIN that not Connected Graph and is Disconnected Graph.
closeness method for connected graph is good and right calculate, and for Disconnected graph in not Good!
library(igraph)
# Create of Graph Matrix for Test Closeness Centrality
g <- read.table(text="A B
1 2
2 4
3 4
3 5", header=TRUE)
gadj <- get.adjacency(graph.edgelist(as.matrix(g), directed=FALSE))
igObject <- graph.adjacency(gadj) # convert adjacency matrix to igraph object
gCloseness <- closeness(igObject,weights = NULL) # Assign Closeness to Variable for print
output :
[1] 0.1000000 0.1428571 0.1428571 0.1666667 0.1000000
my Disconnected Graph:
library(igraph)
# Create of Graph Matrix for Test Closeness Centrality
g <- read.table(text="A B
1 2
3 4
3 5", header=TRUE)
gadj <- get.adjacency(graph.edgelist(as.matrix(g), directed=FALSE))
igObject <- graph.adjacency(gadj) # convert adjacency matrix to igraph object
gCloseness <- closeness(igObject,weights = NULL) # Assign Closeness to Variable for print
output :
[1] 0.06250000 0.06250000 0.08333333 0.07692308 0.07692308
This output is Right ? and if right How to Calculate ?
Please read the documentation of the closeness function; it clearly states how igraph treats disconnected graphs:
If there is no (directed) path between vertex v and i then the total number of vertices is used in the formula instead of the path length.
The calculation then seems to be correct for me, although I would say that closeness centrality itself is not well-defined for disconnected graphs, and what igraph is using here is more of a hack (although a pretty standard hack) than a rigorous treatment of the problem. I would refrain from using closeness centrality on disconnected graphs.