Estimate jaccard similarity mean - igraph

I am working with a network where I am trying to extract the mean value per vertexof the jaccard similarity. I am calculating this in R by using the igraph package. The similarity index estimates a value betweenn each two vertices. The network has 177 vertices, therefore 177 values. It may be easy, but I have not found out the best way to do it.

Sum the columns (or rows), subtract 1 (for the vertex similarity with itself), divide by n-1 rows (or columns)
library(igraph)
g <- make_ring(5)
m <-similarity(g, method="jaccard")
(colSums(m)-1)/(nrow(m)-1)
#[1] 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667

Related

Given an MST for an edge-weighted graph, how can you find the minimally weighted path from x to y?

I have an edge-weighted undirected graph represented by a minimum spanning tree. Each vertice is represented by an integer. The MST looks like this:
I wonder, how can I use this MST to find the shortest path from a vertex x to a vertex y? Say I want to find the shortest path from 0 to 3. It's easy to see that the path is 0-2, 2-3 with total weight 0.26+0.17 = 0.43. But how should I construct a general way of doing this? in pseudocode
edge weight
6-2 0,40
4-5 0.35
5-7 0.28
2-3 0.17
0-2 0.26
1-7 0.19
0-7 0.16
In this case, since you are given an MST you only know that the total edge weights in the graph are minimal. However, a path between two nodes in an MST does not guarantee that it is the minimal path between those two nodes on the actual graph. In order to find the minimally weighted path from node x to node y, you can perform Dijkstra's Algorithm on the original graph (not the MST). Dijkstra's can find the minimum distance from a starting node, in this case, x, to every other node in the graph.
Perform Dijkstra's Algorithm as follows and store the information in a table:
Begin at the starting node, x in this case and go to the node with the least weight from x
From the lowest weight node just visited, explore the neighbors and again pick the edge with the lowest weight
Sum up the total cost so far from the edge you are visiting. If you started at x, then visited a, then c, find the total distance from x to a to c.
If the weight to a node is lower than what was previously recorded, update the value in the table because now a shorter path has been found.
Ultimately, after performing this algorithm, the table should contain the lowest weight path from x to y.
The MST does not necessarily contain the shortest path from one vertex x to another vector y. The minimum spanning tree is a tree that has found the minimal path for every node to be visited. This does not necessarily mean that the shortest path from x to y is included in the MST. To find the true shortest path from x to y you would have to run an algorithm to find the shortest path on the original graph, like Dijkstra's.

Count only a subset of motifs of size k

I want to count the motifs of size 4 in a tree graph:
library(igraph)
g <- barabasi.game(100)
census.motifs <- motifs(g, size=4)[c(4,8,13,30)]
There are 217 possible graphs with 4 vertices, but only 4 of them can appear in a directed rooted tree.
Is there a way to tell igraph that it only has to look for those 4? Or a faster/clever way to do this?
The four motifs in a directed rooted tree could be counted as k-instars using the ergm package http://svitsrv25.epfl.ch/R-doc/library/ergm/html/ergm-terms.html
A k-instar is a set of k nodes all sharing one common root. If n is the number of nodes in your tree, the counts for your 4 motifs will be the number of 3-instars (fully connected), (n-3) times the number of 2-instars (two edges connecting to root and one other node), (n-2) choose 2 times the number of 1-instars (one edge connecting to the the root and two other nodes), and n choose 4 minus the sum of the previous three counts. In R you could use,
library(intergraph)
library(ergm)
library(igraph)
n <- 100
g <- barabasi.game(n)
kistars <- summary(asNetwork(g)~istar(1:3))
kistars[3]
(n-3)*kistars[2]
choose(n-2,2)*kistars[1]
choose(n,4)*sum(kistars)

How to use the subset parameter of igraph's maximal.cliques function?

I have a large graph and I would to find the maximal clique involving a pair of vertices. I thought that the subset argument to igraph's maximal.clique function would do this, but either I'm using it wrong it or it does something completely different. I've spent a fair amount of time searching the web without luck.
Here's a minimal example showing the problem:
> library(igraph)
> packageVersion('igraph')
[1] ‘1.0.1’
> g = graph.empty(n=10, directed=FALSE)
> g = add.edges(g, c(1, 2))
> str(g)
IGRAPH U--- 10 1 --
+ edge:
[1] 1--2
> # This correctly results a clique.
> maximal.cliques(g, min=2)
[[1]]
+ 2/10 vertices:
[1] 2 1
> # These don't return anything!
> maximal.cliques(g, min=2, subset=1)
list()
> maximal.cliques(g, min=2, subset=c(1, 2))
list()
The subset argument is not for calculating the maximal cliques on a subset of the graph; it simply restricts the set of vertices that are used as starting points in the course of the Bron-Kerbosch algorithm when finding maximal cliques. The Bron-Kerbosch algorithm itself still searches the entire graph and is allowed to add or remove vertices from the current set that it considers as it pleases.
The only role of the subset argument is that it allows you to parallelize the maximal cliques computation on large graphs by partitioning the vertex set of the graph into a number of subsets and then running maximal.cliques on multiple CPUs or CPU cores with different subsets. It is not guaranteed that a maximal clique will be found if the starting subset includes any or all of its vertices; for instance, on my machine, the maximal clique 1--2 is found if I use a starting subset consisting of vertex 9 only:
> maximal.cliques(g, subset=c(9))
[[1]]
+ 2/10 vertices:
[1] 2 1
If you want to search for maximal cliques in a subgraph of the original graph, use induced_subgraph first, followed by max_cliques.

Three-way xor-like function

I'm trying to solve the following puzzle:
Given a stream of numbers (only 1 iteration over them is allowed) in which all numbers appear 3 times, but 1 number appear only 2 times, find this number, using O(1) memory.
I started with the idea that, if all numbers appeared 2 times, and 1 number only once, I could use xor operation between all numbers and the result would be the incognito number.
So I want to extend this idea to solve the puzzle. All I need is a xor-like function (or operator), which would yield 0 on the third apply:
SEED xor3 X xor3 X xor3 X = SEED
X xor3 Y xor3 SEED xor3 X xor3 Y xor3 Y xor3 X = SEED
Any ideas for such a function?
Regard XOR as summation on each bit of a number expressed in binary (i.e. a radix of 2), modulo 2.
Now consider a numerical system consisting of tribits 0, 1, and 2. That is, it has a radix of 3.
The operator T now becomes an operation on any number, decomposed into this radix. As in XOR, you sum the bits, but the difference is that operator T is ran in modulo 3.
You can easily show that a T a T a is zero for any a. You can also show that T is both commutative and associative. That is necessary since, in general, your sequence will have the numbers jumbled up.
Now apply this to your list of numbers. At the end of the operation, the output will be b where b = o T o and o is the number that occurs exactly twice.
Your solution for the simpler case (all number appear twice, one number appears once) works since xor operates on each bit x as
x xor x = 0 and 0 xor x = x
xor is basically a bit-wise summation modulus 2. You would need the base-3 equivalent: Transform each number into a base-3 representation. And then use summation modulus 3 for each decimal:
0 1 2
0 0 1 2
1 1 2 0
2 2 0 1
Call this operation xor3. Now you have for each decimal x:
x xor3 x xor3 x = 0 and 0 xor3 x = x
If you apply that to all your numbers then all values that appear 3 times will vanish. The result is x xor3 x of the number x that appears twice. You need to apply decimal-wise division by 2 modulus 3.
I believe there are more efficient ways to implement that. The advantage of the xor function in the first case relies on the fact that xor is a natural base-2 operation. Is there any practical application for that?
This approach is a bit fragile: If the precondition (all numbers appear 3 times except one that appears twice) breaks the algorithm will not help you.
Take a Map with int-keys and int-values. Then walk through your numbers and for each number x increase each the according value. If x is a new key take 0 as start value.
Then you can analyze it easily: Walk through all keys and check the cardinality. It should be three for all keys except one that should be two. This is more robust and my gut feeling says it is also faster.

The most efficient way to calculate an integral in a dataset range

I have an array of 10 rows by 20 columns. Each columns corresponds to a data set that cannot be fitted with any sort of continuous mathematical function (it's a series of numbers derived experimentally). I would like to calculate the integral of each column between row 4 and row 8, then store the obtained result in a new array (20 rows x 1 column).
I have tried using different scipy.integrate modules (e.g. quad, trpz,...).
The problem is that, from what I understand, scipy.integrate must be applied to functions, and I am not sure how to convert each column of my initial array into a function. As an alternative, I thought of calculating the average of each column between row 4 and row 8, then multiply this number by 4 (i.e. 8-4=4, the x-interval) and then store this into my final 20x1 array. The problem is...ehm...that I don't know how to calculate the average over a given range. The question I am asking are:
Which method is more efficient/straightforward?
Can integrals be calculated over a data set like the one that I have described?
How do I calculate the average over a range of rows?
Since you know only the data points, the best choice is to use trapz (the trapezoidal approximation to the integral, based on the data points you know).
You most likely don't want to convert your data sets to functions, and with trapz you don't need to.
So if I understand correctly, you want to do something like this:
from numpy import *
# x-coordinates for data points
x = array([0, 0.4, 1.6, 1.9, 2, 4, 5, 9, 10])
# some random data: 3 whatever data sets (sharing the same x-coordinates)
y = zeros([len(x), 3])
y[:,0] = 123
y[:,1] = 1 + x
y[:,2] = cos(x/5.)
print y
# compute approximations for integral(dataset, x=0..10) for datasets i=0,1,2
yi = trapz(y, x[:,newaxis], axis=0)
# what happens here: x must be an array of the same shape as y
# newaxis tells numpy to add a new "virtual" axis to x, in effect saying that the
# x-coordinates are the same for each data set
# approximations of the integrals based the datasets
# (here we also know the exact values, so print them too)
print yi[0], 123*10
print yi[1], 10 + 10*10/2.
print yi[2], sin(10./5.)*5.
To get the sum of the entries 4 to 8 (including both ends) in each column, use
a = numpy.arange(200).reshape(10, 20)
a[4:9].sum(axis=0)
(The first line is just to create an example array of the desired shape.)