How to use the subset parameter of igraph's maximal.cliques function? - igraph

I have a large graph and I would to find the maximal clique involving a pair of vertices. I thought that the subset argument to igraph's maximal.clique function would do this, but either I'm using it wrong it or it does something completely different. I've spent a fair amount of time searching the web without luck.
Here's a minimal example showing the problem:
> library(igraph)
> packageVersion('igraph')
[1] ‘1.0.1’
> g = graph.empty(n=10, directed=FALSE)
> g = add.edges(g, c(1, 2))
> str(g)
IGRAPH U--- 10 1 --
+ edge:
[1] 1--2
> # This correctly results a clique.
> maximal.cliques(g, min=2)
[[1]]
+ 2/10 vertices:
[1] 2 1
> # These don't return anything!
> maximal.cliques(g, min=2, subset=1)
list()
> maximal.cliques(g, min=2, subset=c(1, 2))
list()

The subset argument is not for calculating the maximal cliques on a subset of the graph; it simply restricts the set of vertices that are used as starting points in the course of the Bron-Kerbosch algorithm when finding maximal cliques. The Bron-Kerbosch algorithm itself still searches the entire graph and is allowed to add or remove vertices from the current set that it considers as it pleases.
The only role of the subset argument is that it allows you to parallelize the maximal cliques computation on large graphs by partitioning the vertex set of the graph into a number of subsets and then running maximal.cliques on multiple CPUs or CPU cores with different subsets. It is not guaranteed that a maximal clique will be found if the starting subset includes any or all of its vertices; for instance, on my machine, the maximal clique 1--2 is found if I use a starting subset consisting of vertex 9 only:
> maximal.cliques(g, subset=c(9))
[[1]]
+ 2/10 vertices:
[1] 2 1
If you want to search for maximal cliques in a subgraph of the original graph, use induced_subgraph first, followed by max_cliques.

Related

Low power and singular fit in mixed model despite high number of observations and low number of factors

Despite I made 276 independent observations across 5 sites (lowest number of obs per site: 23), I get the singularity warning and low power to fit a model with one categorical factor (two levels) and one random factor (site). Anybody could tell why would that happen?
Here is a reproducible example:
dataset:
https://www.dropbox.com/s/azpnhnemgyo0i72/example.xlsx?dl=0
require(readxl)
require(lme4)
df <- data.frame((read_excel("Example.xlsx", sheet=1)))
m1 = lmer(ds.lac ~ sand.avail + (1|site), data = df)
summary(m1)
simr::powerSim(m1, nsim=100)
# results in:
# fit: see ?isSingular
# boundary (singular) fit: sSimulating:
# |============================Simulating:
# |============================Simulating:
# |============================
Power for predictor 'sand.avail', 9.00% (95% confidence interval): ( 4.20, 16.40)
# Test: Kenward Roger (package pbkrtest)
As suggested by #Jose, this really belongs on CrossValidated.
pd <- position_dodge(width=0.75)
gg1 <- ggplot(df,
aes(x = site, y = ds.lac, colour = sand.avail)) +
stat_sum(position = pd, alpha = 0.8)
ggsave("powerSim.png")
Also as suggested by Jose, there are a number of reasons this data set doesn't have as much information as you would like
most (75%) of the observations are 0 (mean(df$ds.lac==0) counts the proportion)
you only have all three sand.avail types available at one site, two of the sites only have one type available (thus there will be a lot of overlap between the random effect of site and the effect of sand.avail)
for modern mixed models, 5 levels of the grouping variable is generally considered a minimum
Furthermore, I have a concern about what powerSim is doing here; this seems to be a post hoc power analysis (i.e., estimating the power of a data set you have already gathered), which is widely criticized by statisticians. It would be perfectly reasonable to say "why isn't the effect of sand.avail significant for this model, fitted to these data?" (e.g. car::Anova(m1) gives p=0.66), but asking about the power for this data set is conceptually problematic.

How does rstan store posterior samples for separate chains?

I would like to understand how the output of extract in rstan orders the posterior samples. I understand that I can view the posterior samples from each chain by using as.array,
stanfit <- sampling(
model,
data = stan.data)
​
fitarray <- as.array(stanfit)
For example, fitarray[, 2, 1] will give me the samples for the second chain of the first parameter. One way to store the posterior samples in the output of extract would be just to concatenate them. When I do,
fit <- extract(stanfit)
mean(fitarray[,2,1]) == mean(fit$ss[1001:2000])
for several chains and parameters I always get TRUE (ss is the first parameter). This makes it seem like the posterior samples are being concatenated in fit. However, when I do,
fitarray[,2,1] == fit$ss[1001:2000]
I get FALSE (confirmed that there's not just precision difference). It appears that fitarray and fit are storing the iterations differently. How do I view the iterations (in order) of each posterior sample chain separately?
As can be seen from rstan:::as.array.stanfit, the as.array method is essentially defined as
extract(x, permuted = FALSE, inc_warmup = FALSE)
Your default use of extract keeps the warmup and permutes the post-warmup draws randomly, which is why the indices do not line up with the as.array output.

How to find a function that fits a given set of data points in Julia?

So, I have a vector that corresponds to a given feature (same dimensionality). Is there a package in Julia that would provide a mathematical function that fits these data points, in relation to the original feature? In other words, I have x and y (both vectors) and need to find a decent mapping between the two, even if it's a highly complex one. The output of this process should be a symbolic formula that connects x and y, e.g. (:x)^3 + log(:x) - 4.2454. It's fine if it's just a polynomial approximation.
I imagine this is a walk in the park if you employ Genetic Programming, but I'd rather opt for a simpler (and faster) approach, if it's available. Thanks
Turns out the Polynomials.jl package includes the function polyfit which does Lagrange interpolation. A usage example would go:
using Polynomials # install with Pkg.add("Polynomials")
x = [1,2,3] # demo x
y = [10,12,4] # demo y
polyfit(x,y)
The last line returns:
Poly(-2.0 + 17.0x - 5.0x^2)`
which evaluates to the correct values.
The polyfit function accepts a maximal degree for the output polynomial, but defaults to using the length of the input vectors x and y minus 1. This is the same degree as the polynomial from the Lagrange formula, and since polynomials of such degree agree on the inputs only if they are identical (this is a basic theorem) - it can be certain this is the same Lagrange polynomial and in fact the only one of such a degree to have this property.
Thanks to the developers of Polynomial.jl for leaving me just to google my way to an Answer.
Take a look to MARS regression. Multi adaptive regression splines.

Generate a powerset with the help of a binary representation

I know that "a powerset is simply any number between 0 and 2^N-1 where N is number of set members and one in binary presentation denotes presence of corresponding member".
(Hynek -Pichi- Vychodil)
I would like to generate a powerset using this mapping from the binary representation to the actual set elements.
How can I do this with Erlang?
I have tried to modify this, but with no success.
UPD: My goal is to write an iterative algorithm that generates a powerset of a set without keeping a stack. I tend to think that binary representation could help me with that.
Here is the successful solution in Ruby, but I need to write it in Erlang.
UPD2: Here is the solution in pseudocode, I would like to make something similar in Erlang.
First of all, I would note that with Erlang a recursive solution does not necessarily imply it will consume extra stack. When a method is tail-recursive (i.e., the last thing it does is the recursive call), the compiler will re-write it into modifying the parameters followed by a jump to the beginning of the method. This is fairly standard for functional languages.
To generate a list of all the numbers A to B, use the library method lists:seq(A, B).
To translate a list of values (such as the list from 0 to 2^N-1) into another list of values (such as the set generated from its binary representation), use lists:map or a list comprehension.
Instead of splitting a number into its binary representation, you might want to consider turning that around and checking whether the corresponding bit is set in each M value (in 0 to 2^N-1) by generating a list of power-of-2-bitmasks. Then, you can do a binary AND to see if the bit is set.
Putting all of that together, you get a solution such as:
generate_powerset(List) ->
% Do some pre-processing of the list to help with checks later.
% This involves modifying the list to combine the element with
% the bitmask it will need later on, such as:
% [a, b, c, d, e] ==> [{1,a}, {2,b}, {4,c}, {8,d}, {16,e}]
PowersOf2 = [1 bsl (X-1) || X <- lists:seq(1, length(List))],
ListWithMasks = lists:zip(PowersOf2, List),
% Generate the list from 0 to 1^N - 1
AllMs = lists:seq(0, (1 bsl length(List)) - 1),
% For each value, generate the corresponding subset
lists:map(fun (M) -> generate_subset(M, ListWithMasks) end, AllMs).
% or, using a list comprehension:
% [generate_subset(M, ListWithMasks) || M <- AllMs].
generate_subset(M, ListWithMasks) ->
% List comprehension: choose each element where the Mask value has
% the corresponding bit set in M.
[Element || {Mask, Element} <- ListWithMasks, M band Mask =/= 0].
However, you can also achieve the same thing using tail recursion without consuming stack space. It also doesn't need to generate or keep around the list from 0 to 2^N-1.
generate_powerset(List) ->
% same preliminary steps as above...
PowersOf2 = [1 bsl (X-1) || X <- lists:seq(1, length(List))],
ListWithMasks = lists:zip(PowersOf2, List),
% call tail-recursive helper method -- it can have the same name
% as long as it has different arity.
generate_powerset(ListWithMasks, (1 bsl length(List)) - 1, []).
generate_powerset(_ListWithMasks, -1, Acc) -> Acc;
generate_powerset(ListWithMasks, M, Acc) ->
generate_powerset(ListWithMasks, M-1,
[generate_subset(M, ListWithMasks) | Acc]).
% same as above...
generate_subset(M, ListWithMasks) ->
[Element || {Mask, Element} <- ListWithMasks, M band Mask =/= 0].
Note that when generating the list of subsets, you'll want to put new elements at the head of the list. Lists are singly-linked and immutable, so if you want to put an element anywhere but the beginning, it has to update the "next" pointers, which causes the list to be copied. That's why the helper function puts the Acc list at the tail instead of doing Acc ++ [generate_subset(...)]. In this case, since we're counting down instead of up, we're already going backwards, so it ends up coming out in the same order.
So, in conclusion,
Looping in Erlang is idiomatically done via a tail recursive function or using a variation of lists:map.
In many (most?) functional languages, including Erlang, tail recursion does not consume extra stack space since it is implemented using jumps.
List construction is typically done backwards (i.e., [NewElement | ExistingList]) for efficiency reasons.
You generally don't want to find the Nth item in a list (using lists:nth) since lists are singly-linked: it would have to iterate the list over and over again. Instead, find a way to iterate the list once, such as how I pre-processed the bit masks above.

The most efficient way to calculate an integral in a dataset range

I have an array of 10 rows by 20 columns. Each columns corresponds to a data set that cannot be fitted with any sort of continuous mathematical function (it's a series of numbers derived experimentally). I would like to calculate the integral of each column between row 4 and row 8, then store the obtained result in a new array (20 rows x 1 column).
I have tried using different scipy.integrate modules (e.g. quad, trpz,...).
The problem is that, from what I understand, scipy.integrate must be applied to functions, and I am not sure how to convert each column of my initial array into a function. As an alternative, I thought of calculating the average of each column between row 4 and row 8, then multiply this number by 4 (i.e. 8-4=4, the x-interval) and then store this into my final 20x1 array. The problem is...ehm...that I don't know how to calculate the average over a given range. The question I am asking are:
Which method is more efficient/straightforward?
Can integrals be calculated over a data set like the one that I have described?
How do I calculate the average over a range of rows?
Since you know only the data points, the best choice is to use trapz (the trapezoidal approximation to the integral, based on the data points you know).
You most likely don't want to convert your data sets to functions, and with trapz you don't need to.
So if I understand correctly, you want to do something like this:
from numpy import *
# x-coordinates for data points
x = array([0, 0.4, 1.6, 1.9, 2, 4, 5, 9, 10])
# some random data: 3 whatever data sets (sharing the same x-coordinates)
y = zeros([len(x), 3])
y[:,0] = 123
y[:,1] = 1 + x
y[:,2] = cos(x/5.)
print y
# compute approximations for integral(dataset, x=0..10) for datasets i=0,1,2
yi = trapz(y, x[:,newaxis], axis=0)
# what happens here: x must be an array of the same shape as y
# newaxis tells numpy to add a new "virtual" axis to x, in effect saying that the
# x-coordinates are the same for each data set
# approximations of the integrals based the datasets
# (here we also know the exact values, so print them too)
print yi[0], 123*10
print yi[1], 10 + 10*10/2.
print yi[2], sin(10./5.)*5.
To get the sum of the entries 4 to 8 (including both ends) in each column, use
a = numpy.arange(200).reshape(10, 20)
a[4:9].sum(axis=0)
(The first line is just to create an example array of the desired shape.)