igraph - How to calculate closeness method in iGraph to disconnected graphs - igraph

I use igraph in R for calculate graph measure, my graph make in a PIN that not Connected Graph and is Disconnected Graph.
closeness method for connected graph is good and right calculate, and for Disconnected graph in not Good!
library(igraph)
# Create of Graph Matrix for Test Closeness Centrality
g <- read.table(text="A B
1 2
2 4
3 4
3 5", header=TRUE)
gadj <- get.adjacency(graph.edgelist(as.matrix(g), directed=FALSE))
igObject <- graph.adjacency(gadj) # convert adjacency matrix to igraph object
gCloseness <- closeness(igObject,weights = NULL) # Assign Closeness to Variable for print
output :
[1] 0.1000000 0.1428571 0.1428571 0.1666667 0.1000000
my Disconnected Graph:
library(igraph)
# Create of Graph Matrix for Test Closeness Centrality
g <- read.table(text="A B
1 2
3 4
3 5", header=TRUE)
gadj <- get.adjacency(graph.edgelist(as.matrix(g), directed=FALSE))
igObject <- graph.adjacency(gadj) # convert adjacency matrix to igraph object
gCloseness <- closeness(igObject,weights = NULL) # Assign Closeness to Variable for print
output :
[1] 0.06250000 0.06250000 0.08333333 0.07692308 0.07692308
This output is Right ? and if right How to Calculate ?

Please read the documentation of the closeness function; it clearly states how igraph treats disconnected graphs:
If there is no (directed) path between vertex v and i then the total number of vertices is used in the formula instead of the path length.
The calculation then seems to be correct for me, although I would say that closeness centrality itself is not well-defined for disconnected graphs, and what igraph is using here is more of a hack (although a pretty standard hack) than a rigorous treatment of the problem. I would refrain from using closeness centrality on disconnected graphs.

Related

Tune input features using backprop in keras

I am trying to implement discriminant condition codes in Keras as proposed in
Xue, Shaofei, et al., "Fast adaptation of deep neural network based
on discriminant codes for speech recognition."
The main idea is you encode each condition as an input parameter and let the network learn dependency between the condition and the feature-label mapping. On a new dataset instead of adapting the entire network you just tune these weights using backprop. For example say my network looks like this
X ---->|----|
|DNN |----> Y
Z --- >|----|
X: features Y: labels Z:condition codes
Now given a pretrained DNN, and X',Y' on a new dataset I am trying to estimate the Z' using backprop that will minimize prediction error on Y'. The math seems straightforward except I am not sure how to implement this in keras without having access to the backprop itself.
For instance, can I add an Input() layer with trainable=True with all other layers set to trainable= False. Can backprop in keras update more than just layer weights? Or is there a way to hack keras layers to do this?
Any suggestions welcome.
thanks
I figured out how to do this (exactly) in Keras by looking at fchollet's post here
Using the keras backend I was able to compute the gradient of my loss w.r.t to Z directly and used it to drive the update.
Code below:
import keras.backend as K
import numpy as np
model.summary() #Pretrained model
loss = K.categorical_crossentropy(Y, Y_out)
grads = K.gradients(loss, Z)
grads /= (K.sqrt(K.mean(K.square(grads)))+ 1e-5)
iterate = K.function([X,Z],[loss,grads])
step = 0.1
Z_adapt = Z_in.copy()
for i in range(100):
loss_val, grads_val = iterate([X_in,Z_adapt])
Z_adapt -= grads_val[0] * step
print "iter:",i,np.mean(loss_value)
print "Before:"
print model.evaluate([X_in, Z_in],Y_out)
print "After:"
print model.evaluate([X_in, Z_adapt],Y_out)
X,Y,Z are nodes in the model graph. Z_in is an initial value for Z'. I set it to an average value from the train set. Z_adapt is after 100 iterations of gradient descent and should give you a better result.
Assume that the size of Z is m x n. Then you can first define an input layer of size m * n x 1. The input will be an m * n x 1 vector of ones. You can define a dense layer containing m * n neurons and set trainable = True for it. The response of this layer will give you a flattened version of Z. Reshape it appropriately and give it as input to the rest of the network that can be appended ahead of this.
Keep in mind that if the size of Z is too large, then network may not be able to learn a dense layer of that many neurons. In that case, maybe you need to put additional constraints or look into convolutional layers. However, convolutional layers will put some constraints on Z.

How to use the subset parameter of igraph's maximal.cliques function?

I have a large graph and I would to find the maximal clique involving a pair of vertices. I thought that the subset argument to igraph's maximal.clique function would do this, but either I'm using it wrong it or it does something completely different. I've spent a fair amount of time searching the web without luck.
Here's a minimal example showing the problem:
> library(igraph)
> packageVersion('igraph')
[1] ‘1.0.1’
> g = graph.empty(n=10, directed=FALSE)
> g = add.edges(g, c(1, 2))
> str(g)
IGRAPH U--- 10 1 --
+ edge:
[1] 1--2
> # This correctly results a clique.
> maximal.cliques(g, min=2)
[[1]]
+ 2/10 vertices:
[1] 2 1
> # These don't return anything!
> maximal.cliques(g, min=2, subset=1)
list()
> maximal.cliques(g, min=2, subset=c(1, 2))
list()
The subset argument is not for calculating the maximal cliques on a subset of the graph; it simply restricts the set of vertices that are used as starting points in the course of the Bron-Kerbosch algorithm when finding maximal cliques. The Bron-Kerbosch algorithm itself still searches the entire graph and is allowed to add or remove vertices from the current set that it considers as it pleases.
The only role of the subset argument is that it allows you to parallelize the maximal cliques computation on large graphs by partitioning the vertex set of the graph into a number of subsets and then running maximal.cliques on multiple CPUs or CPU cores with different subsets. It is not guaranteed that a maximal clique will be found if the starting subset includes any or all of its vertices; for instance, on my machine, the maximal clique 1--2 is found if I use a starting subset consisting of vertex 9 only:
> maximal.cliques(g, subset=c(9))
[[1]]
+ 2/10 vertices:
[1] 2 1
If you want to search for maximal cliques in a subgraph of the original graph, use induced_subgraph first, followed by max_cliques.

Jaccard distance between tweets

I'm currently trying to measure the Jaccard Distance between tweets in a dataset
This is where the dataset is
http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Tweets.json
I've tried a few things to measure the distance
This is what I have so far
I saved the linked dataset to a file called Tweets.json
json_alldata <- fromJSON(sprintf("[%s]", paste(readLines(file("Tweets.json")),collapse=",")))
Then I converted json_alldata to tweet.features and got rid of the geo column
# get rid of geo column
tweet.features = json_alldata
tweet.features$geo <- NULL
These are what the first two tweets look like
tweet.features$text[1]
[1] "RT #ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"
> tweet.features$text[2]
[1] "RT #NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"
First thing I tried was using the method stringdist which is under the stringdist library
install.packages("stringdist")
library(stringdist)
#This works?
#
stringdist(tweet.features$text[1], tweet.features$text[2], method = "jaccard")
When I run that, I get
[1] 0.1621622
I'm not sure that's correct, though. A intersection B = 23, and A union B = 25. The Jaccard distance is A intersection B/A union B -- right? So by my calculation, the Jaccard distance should be 0.92?
So I figured I could do it by sets. Simply calculate intersection and union and divide
This is what I tried
# Jaccard distance is the intersection of A and B divided by the Union of A and B
#
#create set for First Tweet
A1 <- as.set(tweet.features$text[1])
A2 <- as.set(tweet.features$text[2])
When I try to do intersection, I get this: The output is just list()
Intersection <- intersect(A1, A2)
list()
When I try Union, I get this:
union(A1, A2)
[[1]]
[1] "RT #ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"
[[2]]
[1] "RT #NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"
This doesn't seem to be grouping the words into a single set.
I figured I'd be able to divide the intersection by the union. But I guess I would need the program to count the number or words in each set, then do the calculations.
Needless to say, I'm a bit stuck and I'm not sure if I'm on the right track.
Any help would be appreciated. Thank you.
intersect and union expect vectors (as.set does not exist). I think you want to compare words so you can use strsplit but the way the split is done belongs to you. An example below:
tweet.features <- list(tweet1="RT #ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston",
tweet2= "RT #NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston")
jaccard_i <- function(tw1, tw2){
tw1 <- unlist(strsplit(tw1, " |\\."))
tw2 <- unlist(strsplit(tw2, " |\\."))
i <- length(intersect(tw1, tw2))
u <- length(union(tw1, tw2))
list(i=i, u=u, j=i/u)
}
jaccard_i(tweet.features[[1]], tweet.features[[2]])
$i
[1] 20
$u
[1] 23
$j
[1] 0.8695652
Is this want you want?
The strsplit is here done for every space or dot. You may want to refine the split argument from strsplit and replace " |\\." for something more specific (see ?regex).

Problems with results when using betweenness with igraph

I am currently having problems with the results when using betweenness in igraph.
I have created the following network (which is a star network with node 1 in the center)
id <- c(1,1,1,1,1)
rv <- c(2,3,4,5,6)
df <- as.data.frame(cbind(id,rv))
Then I calculate the betweenness for each node and add it to a dataframe:
g3=graph.data.frame(df, directed=TRUE)
bFrame<-as.data.frame(as.table(betweenness(g3)))
The problem is: if you use directed=FALSE you will get that node 1 has a centrality of 10 which makes sense. If you on the other hand use directed=TRUE I get that 1 has a centrality of 0.
In consequence I have two questions:
1. Why is the centrality in the second condition 0?
2. Shouldn't it be 2* the value of the undirected condition? https://en.wikipedia.org/wiki/Betweenness_centrality
Thanks in advance
Pavel

Multiple regression with lagged time series using libsvm

I'm trying to develop a forecaster for electric consumption. So I want to perform a regression using daily data for an entire year. My dataset has several features. Googling I've found that my problem is a Multiple regression problem (Correct me please if I am mistaken).
What I want to do is train a svm for regression with several independent variables and one dependent variable with n lagged days. Here's a sample of my independent variables, I actually have around 10. (We used PCA to determine which variables had some correlation to our problem)
Day Indep1 Indep2 Indep3
1 1.53 2.33 3.81
2 1.71 2.36 3.76
3 1.83 2.81 3.64
... ... ... ...
363 1.5 2.65 3.25
364 1.46 2.46 3.27
365 1.61 2.72 3.13
And the independendant variable 1 is actually my dependant variable in the future. So for example, with a p=2 (lagged days) I would expect my svm to train with the first 2 time series of all three independant variables.
Indep1 Indep2 Indep3
1.53 2.33 3.81
1.71 2.36 3.76
And the output value of the dependent variable would be "1.83" (Indep variable 1 on time 3).
My main problem is that I don't know how to train properly. What I was doing is just putting all features-p in an array for my "x" variables and for my "y" variables I'm just putting my independent variable on p+1 in case I want to predict next day's power consumption.
Example of training.
x with p = 2 and 3 independent variables y for next day
[1.53, 2.33, 3.81, 1.71, 2.36, 3.76] [1.83]
I tried with x being a two dimensional array but when you combine it for several days it becomes a 3d array and libsvm says it can't be.
Perhaps I should change from libsvm to another tool or maybe it's just that I'm training incorrectly.
Thanks for your help,
Aldo.
Let me answer with the python / numpy notation.
Assume the original time series data matrix with columns (Indep1, Indep2, Indep3, ...) is a numpy array data with shape (n_samples, n_variables). Let's generate it randomly for this example:
>>> import numpy as np
>>> n_samples = 100, n_variables = 5
>>> data = np.random.randn(n_samples, n_variables)
>>> data.shape
(100, 5)
If you want to use a window size of 2 time-steps, then the training set can be built as follows:
>>> targets = data[2:, 0] # shape is (n_samples - 2,)
>>> targets.shape
(98,)
>>> features = np.hstack([data[0:-2, :], data[1:-1, :]]) # shape is (n_samples - 2, n_variables * 2)
>>> features.shape
(98, 10)
Now you have your 2D input array + 1D targes that you can feed to libsvm or scikit-learn.
Edit: it might very well be the case that extracting more time-series oriented features such as moving average, moving min, moving max, moving differences (time based derivatives of the signal) or STFT might help your SVM mode make better predictions.