how to find centrality of nodes within clusters using i- graph and python - igraph

I'm working on network analysis and I'm new to python. I want to find out the centrality of every node within a cluster using i graph and python pandas.
I have tried the following:
Creating a graph:
tuples = [tuple(x) for x in data.values]
g=igraph.Graph.TupleList(tuples, directed = False,weights=True)
community detection using fast greedy algorithm:
fm = g.community_fastgreedy()
fm1 = fm.as_clustering()
clusters like this are formed:
[1549] 96650006, 966543799, 966500080
[1401] 96650006, 966567865, 966500069, 966500071
Now, I would like to get the eigenvalue centrality for each number within a cluster, so that i know which is the most important number within a cluster.

I am not very familiar with the eigenvector centrality in igraph, but here is the following solution I came up with:
# initial code is the same as yours
import numpy as np
# iterate over all created subgraphs created:
for subgraph in fm1.subgraphs():
# this is basically already what you want
cents = subgraph.eigenvector_centrality()
# get additionally the index of the respective vector
max_idx = np.argmax(cents)
print(subgraph.vs[max_idx]) # gets the correct vertex element.
Essentially, you want to utilize the option to access the created clusters as a graph (.subgraphs() allows you to do exactly that). The rest is then "just" simple manipulation of the graph object to get the element with the respective maximum eigenvector centrality.

Related

Seurat cross -species integration

I am currently working with single cell data from human and zebrafish both from brain tissue!
My assignment is to integrate them! So the steps I have followed until now :
Find human orthologs for zebrafish genes in biomart
kept only the one2one
subset the zebrafish Seurat object based on the orthlogs and replace the names with the human gene names
Create an new Object for zebrafish and run Normalization anad FindVariableFeatures
Then use this object with my human object for integration
Human object: 20620 features across 2989 samples
Zebrafish object: 6721 features across 6036 samples
features <- SelectIntegrationFeatures(object.list = double.list)
anchors <- FindIntegrationAnchors(object.list = double.list,
anchor.features = features,
normalization.method="LogNormalize",
nn.method="rann")
This identifies 2085 anchors!
I used nn.method="rann" because if I use the default I have this error
Error: C stack usage 7973252 is too close to the limit
Then I am running the integration like this
ZF_HUMAN.combined <- IntegrateData(anchorset = anchors,
new.assay.name = "integrated")
and the error I am receiving is like this
Scaling features for provided objects
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s
Finding all pairwise anchors
| | 0 % ~calculating Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 9265 anchors
Filtering anchors
Retained 2085 anchors
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=22s
To solve this I tried to play around with the arguments in FindIntegrationAnchors
e.g i used l2.norm=F! The only things that changed is the number of anchors which decreased
I am wondering if the usage of nn.method="rann" at FindIntegrationAnchors messing things up
ANY help will be appreciated because I am struggling for a long time with that, I don't know what else to do

Read every nth batch in pyarrow.dataset.Dataset

In Pyarrow now you can do:
a = ds.dataset("blah.parquet")
b = a.to_batches()
first_batch = next(b)
What if I want the iterator to return me every Nth batch instead of every other? Seems like this could be something in FragmentScanOptions but that's not documented at all.
No, there is no way to do that today. I'm not sure what you're after but if you are trying to sample your data there are a few choices but none that achieve quite this effect.
To load only a fraction of your data from disk you can use pyarrow.dataset.head
There is a request in place for randomly sampling a dataset although the proposed implementation would still load all of the data into memory (and just drop rows according to some random probability).
Update: If your dataset is only parquet files then there are some rather custom parts and pieces that you can cobble together to achieve what you want.
a = ds.dataset("blah.parquet")
all_fragments = []
for fragment in a.get_fragments():
for row_group_fragment in fragment.split_by_row_group():
all_fragments.append(row_group_fragment)
sampled_fragments = all_fragments[::2]
# Have to construct the sample dataset manually
sampled_dataset = ds.FileSystemDataset(sampled_fragments, schema=a.schema, format=a.format)
# Iterator which will only return some of the batches
# of the source dataset
sampled_dataset.to_batches()

Data frame error when converting iGraph to gexf object

I am trying to convert an iGraph object to a gexf object using the rgexf package so that I can write a file usable with Gephi, which I prefer for network visualization.
My iGraph object is created by reading in two CSVs: h.edges and h.nodes. There are both edge and node attributes. Once the files are read in, I create the iGraph object, calculate centrality measures and then attach the centrality measures as node attributes. The code looks like so:
iNet = graph_from_data_frame(d=h.edges, vertices = h.nodes, directed = F)
V(iNet)$degree = degree(iNet)
V(iNet)$eig = evcent(iNet)$vector
V(iNet)$betweenness = betweenness(iNet)
This appears to be working fine since I can do all the normal iGraph functions -- plot, calculate centralities, identify communities, etc. My problem comes when I try to convert this to a gexf object. I run the following code:
library(rgexf)
iNet.gexf igraph.to.gexf(iNet)
But get the below error message:
Error in `[.data.frame`(x, r, vars, drop = drop) :
undefined columns selected
Anyone know what's happening? Although I know the example here can all be done just by uploading the two CSVs straight to Gephi and running the calculations there, the end goal is to be able to attach iGraph's more robust calculations as attributes in ways that Gephi can't.

Tf-slim: ValueError: Variable vgg_19/conv1/conv1_1/weights already exists, disallowed. Did you mean to set reuse=True in VarScope?

I am using tf-slim to extract features from several batches of images. The problem is my code works for the first batch , after that I get the error in the title.My code is something like this:
for i in range(0, num_batches):
#Obtain the starting and ending images number for each batch
batch_start = i*training_batch_size
batch_end = min((i+1)*training_batch_size, read_images_number)
#obtain the images from the batch
images = preprocessed_images[batch_start: batch_end]
with slim.arg_scope(vgg.vgg_arg_scope()) as sc:
_, end_points = vgg.vgg_19(tf.to_float(images), num_classes=1000, is_training=False)
init_fn = slim.assign_from_checkpoint_fn(os.path.join(checkpoints_dir, 'vgg_19.ckpt'),slim.get_model_variables('vgg_19'))
feature_conv_2_2 = end_points['vgg_19/pool5']
So as you can see, in each batch, I select a batch of images and use the vgg-19 model to extract features from the pool5 layer. But after the first iteration I get error in the line where I am trying to obtain the end-points. One solution, as I found on the internet is to reset the graph each time , but I don't want to do that because I have some weights in my graph in later part of the code which I train using these extracted features. I don't want to reset them. Any leads highly appreciated. Thanks!
You should create your graph once, not in a loop. The error message tells you exactly that - you try to build the same graph twice.
So it should be (in pseudocode)
create_graph()
load_checkpoint()
for each batch:
process_data()

Caffe: Print the softmax score

In the given example of MNIST in the Caffe installation.
For any given test image, how to get the softmax scores for each category and do some processing on them? Say compute the mean and variance of them.
I am newbie so a detail would help me a lot. I am able to train the model and use the testing feature to get the prediction but I am not sure which files are to be edited in order to get the above results.
You can use python interface
import caffe
net = caffe.Net('/path/to/deploy.prototxt', '/path/to/weights.caffemodel', caffe.TEST)
in_ = read_data(...) # this is up to you to read a sample and convert it to numpy array
out_ = net.forward(data=in_) # assuming your net expects "data" in blob
Now you have the output of your net in a dictionary out (keys are names of output blobs). You can run it in a loop on several examples etc.
I can try to answer your question. Assuming in your deploying net, the softmax layer is like below:
layer {
name: "prob"
type : "Softmax"
bottom: "fc6"
top: "prob"
}
In your python code that processes data, combining with the code #Shai provided, you can get the probability of each category by adding code based on #Shai's code:
predicted_prob = net.blobs['prob'].data
predicted_prob will be returned an array that contains the probabilities with all categories.
For example, if you only have two categories, predicted_prob[0][0] will be the probability that this testing data belongs to one category and predicted_prob[0][1] will be the probability of the other one.
PS:
If you don't want to write any additional python script, according to https://github.com/BVLC/caffe/tree/master/examples/mnist
it says this example will automatically do the testing every 500 iterations. "500" is defined in solver, such as https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_solver.prototxt
So you need to trace back the caffe source code that processes the solver file. I guess it should be https://github.com/BVLC/caffe/blob/master/src/caffe/solver.cpp
I am not sure solver.cpp is the correct file you need to look at. But in this file, you can see it has functions of testing and calculation of some values. I hope it can give you some ideas if no one else can answer your question.