create a weighted undirected graph from data.frame in r? - igraph

I have search similar question before,but still do not solve my question.
I have data frame with three columns,First two columns are
vertex ,third column is weights.
I want to create a weighted undirected graph,I use code like this
graph.data.frame(d =aggdat1, directed = F)
but How can I add weights ?
besides ,in my data frame,there are some repeated edges ,like a-b and b-a
,I just need make "directed = F"?
thank you very much.

Related

Deep Learning CNN MINST TensorFlow applied to my own images

I'm new in Deep Learning and I started with the TenserFlow tutorials (The beginner one and the expert one).In both of them, the data is imported at the beginning with these 2 lines :
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)
I would like to use this neural network on my own images. I have 100 000 images et a fileLabel.txt giving the labels for each image in order by column. Is there a way to change these two lines or a few others to import my images without breaking all the code ? I really don't see how to do that, I have the impression that the structure mnist is specific to the images of the tutorial.
Thanks in advance for your help
Short answer to your question is yes - its possible. You don't need to break any code IF your data is also similar to MNIST data with 10 labels and well organized.
Assuming that is not the case, then you need to organize your input data so that you can define (create) your model.
Organizing of your input data includes
Having consistent image sizes (for example MNIST is 28x28 pixel images)
Labeling of your images (for example MNIST has 10 labels - 0 to 9)
Finally how you intend to split your data (for example MNIST data is split into three parts: 55,000 data points of training data (mnist.train), 10,000 points of test data (mnist.test), and 5,000 points of validation data (mnist.validation).
Once you organize your input data, then you read your data by writing a small function like read_images that does something like
reader = tf.WholeFileReader()
key, value = reader.read(filename_queue)
....
Then assuming you want to "label" ahead of time similar to MNIST data, you can store them in a file and read them in your program.
After that, you would have to populate the tf.train.string_input_producer() with a list of strings containing the filename and the label.
....

Plotting Multiple Regression

We have to do multiple regression for a college project. We have all the data, but we don't know, how to plot it and howenter image description here to draw a regression line.
Our current programming
If you are new to R, try the visreg pacakge. Suppose you you have 3 explanatory variables in your model:
library(visreg)
lm.result <- lm(povred~lnenp+maj+pr)
Since you have 3 variables, visreg would produce 3 plots. In each plot the other two variables are held at mean. So you first want to define a 2x2 space for plots:
par(mfrow=c(2,2))
Then just ask visreg to plot your lm result:
visreg(lm.result)
This is how the plots should look like

Spatial Join for two variable visualization

I want to know if I can use Spatial Join functions for visualize a dataset based in two variables.
My csv has 541000 rows and I'm trying to make a visualization in Zeppelin with Spark to minimize de point draws.
All examples I've seen are to GIS systems but there are not the type of data I need.
My csv is this:
id, variableX, variableY, type.
I'm trying to apply a Spatial Join logic to variableX and variableY.
Thank you.
spark-highcharts might do what you want.
It's too much to plot half million points directly. There are some aggregation or filter needed. spark-highcharts will do the aggregation automatically.
For 2 dimension data, chart type like, line, area, spline.
For 3 dimension data, chart type like, arearange, scatter can be used.
With following code to plot bank data provided in Zeppelin Tutorial. It can plot a spline chart with xAxis use column age, and yAxis using aggregated average balance
import com.knockdata.spark.highcharts._
import com.knockdata.spark.highcharts.model._
highcharts(bank.series("name" -> "age", "y" -> avg($"balance")).orderBy($"age")).
xAxis(new XAxis("age").typ("category")).
chart(Chart.spline).
plot()

What is out of bag error in Random Forests? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
What is out of bag error in Random Forests?
Is it the optimal parameter for finding the right number of trees in a Random Forest?
I will take an attempt to explain:
Suppose our training data set is represented by T and suppose data set has M features (or attributes or variables).
T = {(X1,y1), (X2,y2), ... (Xn, yn)}
and
Xi is input vector {xi1, xi2, ... xiM}
yi is the label (or output or class).
summary of RF:
Random Forests algorithm is a classifier based on primarily two methods -
Bagging
Random subspace method.
Suppose we decide to have S number of trees in our forest then we first create S datasets of "same size as original" created from random resampling of data in T with-replacement (n times for each dataset). This will result in {T1, T2, ... TS} datasets. Each of these is called a bootstrap dataset. Due to "with-replacement" every dataset Ti can have duplicate data records and Ti can be missing several data records from original datasets. This is called Bootstrapping. (en.wikipedia.org/wiki/Bootstrapping_(statistics))
Bagging is the process of taking bootstraps & then aggregating the models learned on each bootstrap.
Now, RF creates S trees and uses m (=sqrt(M) or =floor(lnM+1)) random subfeatures out of M possible features to create any tree. This is called random subspace method.
So for each Ti bootstrap dataset you create a tree Ki. If you want to classify some input data D = {x1, x2, ..., xM} you let it pass through each tree and produce S outputs (one for each tree) which can be denoted by Y = {y1, y2, ..., ys}. Final prediction is a majority vote on this set.
Out-of-bag error:
After creating the classifiers (S trees), for each (Xi,yi) in the original training set i.e. T, select all Tk which does not include (Xi,yi). This subset, pay attention, is a set of boostrap datasets which does not contain a particular record from the original dataset. This set is called out-of-bag examples. There are n such subsets (one for each data record in original dataset T). OOB classifier is the aggregation of votes ONLY over Tk such that it does not contain (xi,yi).
Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set (compare it with known yi's).
Why is it important?
The study of error estimates for bagged classifiers in Breiman
[1996b], gives empirical evidence to show that the out-of-bag estimate
is as accurate as using a test set of the same size as the training
set. Therefore, using the out-of-bag error estimate removes the need
for a set aside test set.1
(Thanks #Rudolf for corrections. His comments below.)
In Breiman's original implementation of the random forest algorithm, each tree is trained on about 2/3 of the total training data. As the forest is built, each tree can thus be tested (similar to leave one out cross validation) on the samples not used in building that tree. This is the out of bag error estimate - an internal error estimate of a random forest as it is being constructed.

Extract 3D coordinates from R PCA

I am trying to find a way make 3D PCA visualization from R more portable;
I have run a PCA on 2D matrix using prcomp().
How do I export the 3D coordinates of data points, along with labels and colors (RGB) associated with each?
Whats the practical difference with princomp() and prcomp()?
Any ideas on how to best view the 3D PCA plot using HTML5 and canvas?
Thanks!
Here is an example to work from:
pc <- prcomp(~ . - Species, data = iris, scale = TRUE)
The axis scores are extracted from component x; as such you can just write out (you don't say how you want the exported) as CSV using:
write.csv(pc$x[, 1:3], "my_pc_scores.csv")
If you want to assign information to these scores (the colours and labels, which are not something associated with the PCA but something you assign yourself), then add them to the matrix of scores and then export. In the example above there are three species with 50 observations each. If we want that information exported alongside the scores then something like this will work
scrs <- data.frame(pc$x[, 1:3], Species = iris$Species,
Colour = rep(c("red","green","black"), each = 50))
write.csv(scrs, "my_pc_scores2.csv")
scrs looks like this:
> head(scrs)
PC1 PC2 PC3 Species Colour
1 -2.257141 -0.4784238 0.12727962 setosa red
2 -2.074013 0.6718827 0.23382552 setosa red
3 -2.356335 0.3407664 -0.04405390 setosa red
4 -2.291707 0.5953999 -0.09098530 setosa red
5 -2.381863 -0.6446757 -0.01568565 setosa red
6 -2.068701 -1.4842053 -0.02687825 setosa red
Update missed the point about RGB. See ?rgb for ways of specifying this in R, but if all you want are the RGB strings then change the above to use something like
Colour = rep(c("#FF0000","#00FF00","#000000"), each = 50)
instead, where you specify the RGB strings you want.
The essential difference between princomp() and prcomp() is the algorithm used to calculate the PCA. princomp() uses a Eigen decomposition of the covariance or correlation matrix whilst prcomp() uses the singular value decomposition (SVD) of the raw data matrix. princomp() only handles data sets where there are at least as many samples (rows) and variables (columns) in your data. prcomp() can handle that type of data and data sets where there are more columns than rows. In addition, and perhaps of greater importance depending on what uses you had in mind, the SVD is preferred over the eigen decomposition for it's better numerical accuracy.
I have tagged the Q with html5 and canvas in the hope specialists in those can help. If you don't get any responses, delete point 3 from your Q and start a new one specifically on the topic of displaying the PCs using canvas, referencing this one for detail.
You can find out about any R object by doing str(object_name). In this case:
m <- matrix(rnorm(50), nrow = 10)
res <- prcomp(m)
str(m)
If you look at the help page for prcomp by doing ?prcomp, you can discover that the scores are in res$x and the loadings are in res$rotation. These are labeled by PC already. There are no colors, unless you decide to assign some colors in the course of a plot. See the respective help pages to compare princomp with prcomp for a comparison between the two functions. Basically, the difference between them has to do with the method used behind the scenes. I can't help you with your last question.
You state that your perform PCA on a 2D matrix. If this is your data matrix there is no way to get 3D PCA's. Ofcourse it might be that your 2D matrix is a covariance matrix of the data, in that case you need to use princomp (not prcomp!) and explictely pass the covariance matrix m like this:
princomp(covmat = m)
Passing the covariance matrix like:
princomp(m)
does not yield the correct result.