Self Initializing Cellular Automata - cellular-automata

What is the largest cellular automata one can generate given the conditions:
-n x m rectangular grid
-Basic Conway Rules
-Grid is not continuous (meaning edge behavior is not wrapped)
-perfectly/near perfectly reproduces its initialization pattern (therefore recursive).

Related

Cost of back-propagation for subset of DNN parameters

I am using pytorch for evaluating gradients of feed-forward network, but only for a subset of parameters, related to the first two layers.
Since backpropagation is carried backwards layer by layer, I wonder: why is it computationally faster than evaluating gradients of whole network?
Pytorch builds a computation graph for backward propagation that only contains the minimum nodes and edges to get the accumulated gradient for leaves that require gradient. Even if the first two layers require gradient, there are many tensors (intermediate tensors or frozen parameters tensors) that are unused and that are cut in the backward graph. Plus the built-in function AccumulatedGradient that stores the gradients in .grad attribute is call less time reducing the total computation time too.
Here you can see an example for an "AddBackward Node" where for instance A is an intermediate tensor computed with the first two layers and B is the 3rd (constant) layer that can be ignored.
An other example: if you have a matrix-matrix product (MmBackward Node) that uses an intermediate tensor that not depends on the 2 first layers. In this case the tensor itself is required to compute the backprop but the "previous" tensors that were used to compute it can be ignored in the graph.
To visualize the sub-graph that is actually computed (and compare when the model is unfrozen), you can use torchviz.

Why is 1x1 conv same as fully connected layer?

i am currently reading " Network in Network' paper.
And in the paper, it is stated that
"the cross channel parametric pooling layer is also equivalent to convolution layer with
1x1 convolution kernel. "
My question is first of all, what is cross channel parametric pooling layer exactly mean?is it just fully connected layer?
And why is cross channel parametric pooling layer same with 1x1 convolution kernel.
It would be thankful if you answer both mathematically and with examples.
Please help me~
I haven't read the paper but I have a fair idea of what this is. First of all
How is a 1x1 convolution like a fully connected layer?
So we have a feature map with dims (C, H, W), where C = (number of channels), H = height, W = width. I'll call positions in (H, W) "pixels". A 1x1 convolution will consist of C' (number of output channels of the convolution) kernels each with shape (C, 1, 1). So if we consider any pixel in the input feature map, we can apply a single (C, 1, 1) kernel to it to produce a (1, 1, 1) output. Applying C' different kernels will result in a (C', 1, 1) output. This is equivalent to applying a single fully connected layer to one pixel of the input feature map. Have a look at the following diagram to understand the action of a 1x1 convolution to a single pixel of the input feature map
The different colors represent different kernels of the convolution, corresponding to different output channels. You can see now how the kernels effectively comprise the weights of a single fully connected layer.
What is cross channel parametric pooling?
This is where I'm going to make a guess I'm 90% certain of (not 100% because I didn't read the paper). This is just an extension of the logic above, to whole feature maps rather than individual pixels. You're applying a cross-channel aggregation mechanism. The mechanism is parametric because it's not just a simple mean or sum or max, it's actually a parameterised weighted sum. Also note that the weights are held constant across all pixels (remember, that's how convolution kernels work). So it's essentially the same as applying the weights of a single fully connected layer to channels of a feature map in order to produce a different set of feature maps. But instead of applying the weights to individual neurons, you are applying them to the all the neurons of the feature map at the same time:

Why W_q matrix in torch.nn.MultiheadAttention is quadratic

I am trying to implement nn.MultiheadAttention in my network. According to the docs,
embed_dim  – total dimension of the model.
However, according to the source file,
embed_dim must be divisible by num_heads
and
self.q_proj_weight = Parameter(torch.Tensor(embed_dim, embed_dim))
If I understand properly, this means each head takes only a part of features of each query, as the matrix is quadratic. Is it a bug of realization or is my understanding wrong?
Each head uses a different part of the projected query vector. You can imagine it as if the query gets split into num_heads vectors that are independently used to compute the scaled dot-product attention. So, each head operates on a different linear combination of the features in queries (and keys and values, too). This linear projection is done using the self.q_proj_weight matrix and the projected queries are passed to F.multi_head_attention_forward function.
In F.multi_head_attention_forward, it is implemented by reshaping and transposing the query vector, so that the independent attentions for individual heads can be computed efficiently by matrix multiplication.
The attention head sizes are a design decision of PyTorch. In theory, you could have a different head size, so the projection matrix would have a shape of embedding_dim × num_heads * head_dims. Some implementations of transformers (such as C++-based Marian for machine translation, or Huggingface's Transformers) allow that.

What is the purpose of the ROI layer in a Fast R-CNN?

In this tutorial about object detection, the fast R-CNN is mentioned. The ROI (region of interest) layer is also mentioned.
What is happening, mathematically, when region proposals get resized according to final convolution layer activation functions (in each cell)?
Region-of-Interest(RoI) Pooling:
It is a type of pooling layer which performs max pooling on inputs (here, convnet feature maps) of non-uniform sizes and produces a small feature map of fixed size (say 7x7). The choice of this fixed size is a network hyper-parameter and is predefined.
The main purpose of doing such a pooling is to speed up the training and test time and also to train the whole system from end-to-end (in a joint manner).
It's because of the usage of this pooling layer the training & test time is faster compared to original(vanilla?) R-CNN architecture and hence the name Fast R-CNN.
Simple example (from Region of interest pooling explained by deepsense.io):
ROI (region of interest) layer is introduced in Fast R-CNN and is a special case of spatial pyramid pooling layer which is introduced in Spatial Pyramid Pooling in Deep Convolutional
Networks for Visual Recognition. The main function of ROI layer is reshape inputs with arbitrary size into a fixed length output because of size constraint in Fully Connected layers.
How ROI layer works is showed below:
In this image, the input image with arbitrary size is fed into this layer which has 3 different window: 4x4 (blue), 2x2 (green), 1x1 (gray) to produce outputs with fixed size of 16 x F, 4 x F, and 1 x F, respectively, where F is the number of filters. Then, those outputs are concatenated into a vector to be fed to Fully Connected layer.

Rules to set hyper-parameters alpha and theta in LDA model

I will like to know more about whether or not there are any rule to set the hyper-parameters alpha and theta in the LDA model. I run an LDA model given by the library gensim:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=30, id2word = dictionary, passes=50, minimum_probability=0)
But I have my doubts on the specification of the hyper-parameters. From what I red in the library documentation, both hyper-parameters are set to 1/number of topics. Given that my model has 30 topics, both hyper-parameters are set to a common value 1/30. I am running the model in news-articles that describe the economic activity. For this reason, I expect that the document-topic distribution (theta) to be high (similar topics in documents),while the topic-word distribution (alpha) be high as well (topics sharing many words in common, or, words not being so exclusive for each topic). For this reason, and given that my understanding of the hyper-parameters is correct, is 1/30 a correct specification value?
I'll assume you expect theta and phi (document-topic proportion and topic-word proportion) to be closer to equiprobable distributions instead of sparse ones, with exclusive topics/words.
Since alpha and beta are parameters to a symmetric Dirichlet prior, they have a direct influence on what you want. A Dirichlet distribution outputs probability distributions. When the parameter is 1, all possible distributions are equally liked to outcome (for K=2, [0.5,0.5] and [0.99,0.01] have the same chances). When parameter>1, this parameter behaves as a pseudo-counter, as a prior belief. For a high value, equiprobable output is preferred (P([0.5,0.5])>P([0.99,0.01]). Parameter<1 has the opposite behaviour. For big vocabularies you don't expect topics with probability in all words, that's why beta tends to be under 1 (the same for alpha).
However, since you're using Gensim, you can let the model learn alpha and beta values for you, allowing to learn asymmetric vectors (see here), where it stands
alpha can be set to an explicit array = prior of your choice. It also
support special values of ‘asymmetric’ and ‘auto’: the former uses a
fixed normalized asymmetric 1.0/topicno prior, the latter learns an
asymmetric prior directly from your data.
The same for eta (which I call beta).