Parsing JSON URL in R with different number of fields - json

I'm having a lot of trouble trying to read some JSON data obtained from a URL in R. I'm able to read in the data, and call on each observation to get the values (as characters which is fine), but I can't seem to find a way to get the data in a table format (basically like in excel).
I've tried to create a loop which calls on each field to place it in an empty matrix, however not every object has the same number of fields (ie. some values have Label1 and Label2, while others just have Label1). I get the error that the subscipts are out of bounds. What I was thinking was to make a conditional statement whereas if the field existed then the value of the field would be put in the data matrix, and if the field does not exist then I would insert an NA. I get a subscript error automatically though and cannot do the conditional evaluation - I've looked to see if I can coerce an error to become an NA, but I don't think this is possible.
I'm starting the index from j=3, since the first two observations in the JSON code are not needed for me. My problem is that for example "json$poi[[j]]$label[[2]]$value" may not exist for every observation and I automatically get an error when the code comes across the first observation missing this field.
The data is quite big - around 4480 observations with up to 20 fields each. I only require the 9 fields I have listed however. Here is a link to the data URL - it may take some time to load. Im quite new to coding, and especially trying to deal with JSON files, so my apology if this has a simple solution that I'm not seeing.
Thanks!
http://tourism.citysdk.cm-lisboa.pt/pois/?limit=-1
library(rjson)
library(RCurl)
json <- fromJSON(getURL('http://tourism.citysdk.cm-lisboa.pt/pois/?limit=-1'))
ljson <- length(json$poi)-2
data <- matrix(data=NA, nrow=ljson, ncol=9)
for(i in 1:ljson)
{
j <- i+2
d1 <- json$poi[[j]]$location$point[[1]]$Point$posList
d2 <- json$poi[[j]]$label[[1]]$value
d3 <- json$poi[[j]]$label[[2]]$value
d4 <- json$poi[[j]]$category[[1]]$value
d5 <- json$poi[[j]]$category[[2]]$value
d6 <- json$poi[[j]]$id
d7 <- json$poi[[j]]$author$value
d8 <- json$poi[[j]]$license$value
d9 <- json$poi[[j]]$description[[1]]$value
if(exists("d1") == TRUE){
d1 <- json$poi[[j]]$location$point[[1]]$Point$posList
} else {
d1 <- NA
}
if(exists("d2") == TRUE){
d2 <- json$poi[[j]]$label[[1]]$value
} else {
d2 <- NA
}
if(exists("d3") == TRUE){
d3 <- json$poi[[j]]$label[[2]]$value
} else {
d3 <- NA
}
if(exists("d4") == TRUE){
d4 <- json$poi[[j]]$category[[1]]$value
} else {
d4 <- NA
}
if(exists("d5") == TRUE){
d5 <- json$poi[[j]]$category[[2]]$value
} else {
d5 <- NA
}
if(exists("d6") == TRUE){
d6 <- json$poi[[j]]$id
} else {
d6 <- NA
}
if(exists("d7") == TRUE){
d7 <- json$poi[[j]]$author$value
} else {
d7 <- NA
}
if(exists("d8") == TRUE){
d8 <- json$poi[[j]]$license$value
} else {
d8 <- NA
}
if(exists("d9") == TRUE){
d9 <- json$poi[[j]]$description[[1]]$value
} else {
d9 <- NA
}
data[i,] <- rbind(c(d1,d2,d3,d4,d5,d6,d7,d8,d9))
}

For JSON & XML list structures str is your friend! You can use that to inspect all or portions of a list structure. sapply on individual components to extract is probably better than the for construct and you'll need to handle NULLs and missing sub-structure components to build a data frame from that JSON (and many JSON files, actually). The following gets you started, but you still have some work to do:
# simplify extraction (saves typing, too)
poi <- json$poi
# start at 3rd element
poi <- poi[3:length(poi)]
# have to do some special checking since the value isn't always there
poi_points <- sapply(poi, function(x) {
if ("point" %in% names(x$location) & length(x$location$point) > 0) {
x$location$point[[1]]$Point$posList
} else {
NA
}
})
# this removes NULLs which the data.frame call won't like later
poi_description <- sapply(poi, function(x) {
if (is.null(x$description[[1]]$value)) {
NA
} else {
x$description[[1]]$value
}
})
# this removes NULLs which the data.frame call won't like later
poi_category <- sapply(poi, function(x) {
if (is.null(x$category[[1]]$value)) {
NA
} else {
x$category[[1]]$value
}
})
# simpler extractions
poi_label <- sapply(poi, function(x) x$label[[1]]$value)
poi_id <- sapply(poi, function(x) x$id)
poi_author <- sapply(poi, function(x) x$author$value)
poi_license <- sapply(poi, function(x) x$license$value)
# make a data frame
poi <- data.frame(poi_label, poi_category, poi_id, poi_points, poi_author, poi_license, poi_description)
str(poi)
## 'data.frame': 4482 obs. of 7 variables:
## $ poi_label : Factor w/ 4482 levels "\"Bloco das Águas Livres\", edifício de habitação, comércio e serviços",..: 363 765 764 1068 174 419 461 762 420 412 ...
## $ poi_category : Factor w/ 129 levels "Acessórios de Uso Pessoal",..: 33 33 33 33 33 33 123 33 33 33 ...
## $ poi_id : Factor w/ 4482 levels "52d7bf4d723e8e0b0cc08b69",..: 2 3 4 5 7 8 15 16 17 18 ...
## $ poi_points : Factor w/ 3634 levels "38.405892 -9.93503",..: 975 244 478 416 301 541 2936 2975 2850 2830 ...
## $ poi_author : Factor w/ 1 level "CitySDK": 1 1 1 1 1 1 1 1 1 1 ...
## $ poi_license : Factor w/ 1 level "open-data": 1 1 1 1 1 1 1 1 1 1 ...
## $ poi_description: Factor w/ 2831 levels "","\n","\n\n",..: 96 1051 NA NA 777 1902 NA 1038 81 82 ...
##

Related

Extracting JSON-data from CSV file

I'm trying to extract a JSON data which is a column in a CSV file. So far I've come to the point where I've extracted the column in the right format, but the formatting is only correct when the variable type is factor. But I can't convert a factor to a json-file using the jsonlite package.
[1] {"id":509746197991998767,"visibility":{"percentage":100,"time":149797,"visible1":true,"visible2":false,"visible3":false,"activetab":true},"interaction":{"mouseovercount":1,"mouseovertime":1426,"videoplaytime":0,"engagementtime":0,"expandtime":0,"exposuretime":35192}}
Another approach is to use stringsAsFactors = F when importing, but I'm struggling in getting the formatting right, where each entry looks like this:
[1] "{\"id\":509746197991998767,\"visibility\":{\"percentage\":100,\"time\":149797,\"visible1\":true,\"visible2\":false,\"visible3\":false,\"activetab\":true},\"interaction\":{\"mouseovercount\":1,\"mouseovertime\":1426,\"videoplaytime\":0,\"engagementtime\":0,\"expandtime\":0,\"exposuretime\":35192}}"
Am I missing something obvious here? I simply just want to exract the JSON files that sits inside a CSV file.
Heres a small example of the CSV file:
"","CookieID","UnloadVars"
"1",-8857188784608690176,"{""id"":509746197991998767,""visibility"":{""percentage"":100,""time"":149797,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":1,""mouseovertime"":1426,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":35192}}"
"2",-1695626857458244096,"{""id"":2917654329769114342,""visibility"":{""percentage"":46,""time"":0,""visible1"":false,""visible2"":false,""visible3"":false,""activetab"":true}}"
"3",437299165071669184,"{""id"":2252707957388071809,""visibility"":{""percentage"":99,""time"":10168,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":0,""mouseovertime"":0,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":542},""clicks"":[{""x"":105,""y"":449}]}"
"4",292660729552227520,""
"5",7036383942916227072,"{""id"":2299674593327687292,""visibility"":{""percentage"":76,""time"":1145,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":0,""mouseovertime"":0,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":74},""clicks"":[{""x"":197,""y"":135},{""x"":197,""y"":135}]}"
Regards,
Frederik.
df <- readr::read_csv('"","CookieID","UnloadVars"
"1",-8857188784608690176,"{""id"":509746197991998767,""visibility"":{""percentage"":100,""time"":149797,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":1,""mouseovertime"":1426,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":35192}}"
"2",-1695626857458244096,"{""id"":2917654329769114342,""visibility"":{""percentage"":46,""time"":0,""visible1"":false,""visible2"":false,""visible3"":false,""activetab"":true}}"
"3",437299165071669184,"{""id"":2252707957388071809,""visibility"":{""percentage"":99,""time"":10168,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":0,""mouseovertime"":0,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":542},""clicks"":[{""x"":105,""y"":449}]}"
"4",292660729552227520,""
"5",7036383942916227072,"{""id"":2299674593327687292,""visibility"":{""percentage"":76,""time"":1145,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":0,""mouseovertime"":0,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":74},""clicks"":[{""x"":197,""y"":135},{""x"":197,""y"":135}]}"',
col_types = "-cc")
Using jsonlite::fromJSON on each separate value, then tidyr::unnest
library(dplyr)
f <- function(.x)
if (is.na(.x) || .x == "") data.frame()[1, ] else
as.data.frame(jsonlite::fromJSON(.x))
df %>%
tidyr::unnest(UnloadVars = lapply(UnloadVars, f)) %>%
mutate_at(vars(ends_with("id")), as.character)
# A tibble: 6 x 16
# CookieID id visibility.percentage visibility.time visibility.visible1 visibility.visible2 visibility.visible3 visibility.activetab interaction.mouseovercount interaction.mouseovertime interaction.videoplaytime interaction.engagementtime interaction.expandtime interaction.exposuretime clicks.x clicks.y
# <chr> <chr> <int> <int> <lgl> <lgl> <lgl> <lgl> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 -8857188784608690176 509746197991998784 100 149797 TRUE FALSE FALSE TRUE 1 1426 0 0 0 35192 NA NA
# 2 -1695626857458244096 2917654329769114112 46 0 FALSE FALSE FALSE TRUE NA NA NA NA NA NA NA NA
# 3 437299165071669184 2252707957388071936 99 10168 TRUE FALSE FALSE TRUE 0 0 0 0 0 542 105 449
# 4 292660729552227520 <NA> NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 5 7036383942916227072 2299674593327687168 76 1145 TRUE FALSE FALSE TRUE 0 0 0 0 0 74 197 135
# 6 7036383942916227072 2299674593327687168 76 1145 TRUE FALSE FALSE TRUE 0 0 0 0 0 74 197 135
I used readr::read_csv to read in your sample data set.
> df <- readr::read_csv('~/sample.csv')
Parsed with column specification:
cols(
CookieID = col_double(),
UnloadVars = col_character()
)
As you can see the UnloadVars are read in as characters and not factors. If I now examine the first value in the UnloadVars columns I see the following which matches what you get,
> df$UnloadVars[1]
[1] "{\"id\":509746197991998767,\"visibility\":{\"percentage\":100,\"time\":149797,\"visible1\":true,\"visible2\":false,\"visible3\":false,\"activetab\":true},\"interaction\":{\"mouseovercount\":1,\"mouseovertime\":1426,\"videoplaytime\":0,\"engagementtime\":0,\"expandtime\":0,\"exposuretime\":35192}}"
Now, I use jsonlite::fromJSON,
> j <- jsonlite::fromJSON(df$UnloadVars[1])
> j
$id
[1] 5.097462e+17
$visibility
$visibility$percentage
[1] 100
$visibility$time
[1] 149797
$visibility$visible1
[1] TRUE
$visibility$visible2
[1] FALSE
$visibility$visible3
[1] FALSE
$visibility$activetab
[1] TRUE
$interaction
$interaction$mouseovercount
[1] 1
$interaction$mouseovertime
[1] 1426
$interaction$videoplaytime
[1] 0
$interaction$engagementtime
[1] 0
$interaction$expandtime
[1] 0
$interaction$exposuretime
[1] 35192
Which I believe is what you need since JSONs are parsed as lists in R.
It can be very tricky to deal with JSON data. As a general guide line, you should always strive to have your data in a data frame. This, however, is not always possible. In the specific case, I don't see a way you can have both visibility and interaction values at once in a nicely formatted data frame.
What I will do next is to extract the information from interaction into a data frame.
Load required packages and read the data
library(purrr)
library(dplyr)
library(tidyr)
df <- read.csv("sample.csv", stringsAsFactors = FALSE)
Then remove unvalid JSON
# remove rows without JSON (in this case, the 4th row)
df <- df %>%
dplyr::filter(UnloadVars != "")
Transform each JSON into a list and put them into UnloadVars column. If you didn't know that, it is possible to have list column in a data frame. This can be very useful.
out <- data_frame(CookieID = numeric(),
UnloadVars = list())
for (row in 1:nrow(df)) {
new_row <- data_frame(CookieID = df[row, ]$CookieID,
UnloadVars = list(jsonlite::fromJSON(df[row, ]$UnloadVars)))
out <- bind_rows(out, new_row)
}
out
We can now extract the IDs from the lists in Unload Vars. This is straight forward because there is only one ID per list.
out <- out %>%
mutate(id = map_chr(UnloadVars, ~ .$id))
This final part can seem a bit intimidating. But what I am doing here is taking interaction part from UnloadVars column and putting it into a interaction column. I then transform each row from interaction, which is a list, into a data frame with two columns: key and value. key contains the name of the interaction metric and value its value. I finally unnest it, so we get rid of list columns and end up with a nicely formatted data frame.
unpack_list <- function(obj, key_name) {
as.data.frame(obj) %>%
gather(key) %>%
return()
}
df_interaction <- out %>%
mutate(interaction = map(UnloadVars, ~ .$interaction)) %>%
mutate(interaction = map(interaction, ~ unpack_list(.x, key))) %>%
unnest(interaction)
df_interaction
The solution is not very elegant, but gets the job done. You could apply the same logic to extract information from visibility.

Convert JSON into CSV in R programming

I have JSON of the form:
{"abc":
{
"123":[45600],
"378":[78689],
"343":[23456]
}
}
I need to convert above format JSON to CSV file in R.
CSV format :
ds y
123 45600
378 78689
343 23456
I'm using R library rjson to do so. I'm doing something like this:
jsonFile <- fromJSON(file=fileName)
json_data_frame <- as.data.frame(jsonFile)
but it's not doing the way I need it.
You can use jsonlite::fromJSON to read the data into a list, though you'll need to pull it apart to assemble it into a data.frame:
abc <- jsonlite::fromJSON('{"abc":
{
"123":[45600],
"378":[78689],
"343":[23456]
}
}')
abc <- data.frame(ds = names(abc[[1]]),
y = unlist(abc[[1]]), stringsAsFactors = FALSE)
abc
#> ds y
#> 123 123 45600
#> 378 378 78689
#> 343 343 23456
I believe you got the json file reader - fromJSON function right.
df <- data.frame( do.call(rbind, rjson::fromJSON( '{"a":true, "b":false, "c":null}' )) )
The code below gets me Google's Location History (json) archive from https://takeout.google.com. This is if you have enabled a 'Timeline' (location tracking) in Google Maps on your cell. Credit to http://rpubs.com/jsmanij/131030 for the original code. Note that json files like this can be quite large and plyr::llply is so much more efficient than lapply in parsing a list. Data.table gives me the more efficient 'rbindlist' to take the list to a data.table. Google logs between 350 to 800 GPS calls each day for me! A multi-year location history is converted to quite a sizeable list by 'fromJSON':
format(object.size(doc1),units="MB")
[1] "962.5 Mb"
I found 'do.call(rbind..)' un-optimized. The timestamp, lat, and long needed some work to be useful to Google Earth Pro, but I am getting carried away. At the end, I use 'write.csv' to take a data.table to CSV. That is all the original OP wanted here.
ts lat long latitude longitude
1: 1416680531900 487716717 -1224893214 48.77167 -122.4893
2: 1416680591911 487716757 -1224892938 48.77168 -122.4893
3: 1416680668812 487716933 -1224893231 48.77169 -122.4893
4: 1416680728947 487716468 -1224893275 48.77165 -122.4893
5: 1416680791884 487716554 -1224893232 48.77166 -122.4893
library(data.table)
library(rjson)
library(plyr)
doc1 <- fromJSON(file="LocationHistory.json", method="C")
object.size(doc1)
timestamp <- function(x) {as.list(x$timestampMs)}
timestamps <- as.list(plyr::llply(doc1$locations,timestamp))
timestamps <- rbindlist(timestamps)
latitude <- function(x) {as.list(x$latitudeE7)}
latitudes <- as.list(plyr::llply(doc1$locations,latitude))
latitudes <- rbindlist(latitudes)
longitude <- function(x) {as.list(x$longitudeE7)}
longitudes <- as.list(plyr::llply(doc1$locations,longitude))
longitudes <- rbindlist(longitudes)
datageoms <- setnames(cbind(timestamps,latitudes,longitudes),c("ts","lat","long")) [order(ts)]
write.csv(datageoms,"datageoms.csv",row.names=FALSE)

Flatten deep nested json in R

I am trying to use R to convert a nested JSON file into a two dimensional dataframe.
My JSON file has a nested structure. But, the names and properties are the same across levels.
{"name":"A", "value":"1", "c":
[{"name":"a1", "value":"11", "c":
[{"name":"a11", "value":"111"},
{"name":"a12", "value":"112"}]
},
{"name":"a2", "value":"12"}]
}
The desired dataset would look like this. Although the exact column names can be different.
name value c__name c_value c_c_name c_c_value
A 1 a1 11 a11 111
A 1 a1 11 a12 112
A 1 a2 12
The code I have so far flattens the data, but it only seems to work for the first level (see the screenshot of the output).
library(jsonlite)
json_file <- ' {"name":"A", "value":"1", "c":
[{"name":"a1", "value":"11", "c":
[{"name":"a11", "value":"111"},
{"name":"a12", "value":"112"}]
},
{"name":"a2", "value":"12"}]
}'
data <- fromJSON(json_file, flatten = TRUE)
View(data)
I tried multiple packages, including jsonlite and RJSONIO, I spent the last 5 hours 5 hours debugging this and trying various online tutorial, but without success. Thanks for your help!
Firstly, that is some ugly JSON; if you have a way of avoiding it, do so. Consequently, what follows is also pretty ugly—to the degree that I normally wouldn't post it, but I am doing so now in the hope that some of the approaches may be of use. If it offends your eyes, let me know and I'll delete it.
library(jsonlite) # for fromJSON
library(reshape2) # for melt
library(dplyr) # for inner_join, select
jlist <- fromJSON(json_file)
jdf <- as.data.frame(jlist)
jdf$c.value <- as.numeric(jdf$c.value) # fix type
jdf$L1 <- as.integer(factor(jdf$c.name)) # for use as a key with an artifact of melt later *urg, sorry*
ccdf <- melt(jdf$c.c) # get nested list into usable form
names(ccdf)[1:2] <- c('c.c.name', 'c.c.value') # fix names so they won't cause problems with the join
df3 <- inner_join(jdf[, -5], ccdf) # join, take out nested column
df3$c.c.value <- as.numeric(df3$c.c.value) # fix type
df3 <- df3 %>% select(-L1, -c) # get rid of useless columns
which leaves you with
> df3
name value c.name c.value c.c.name c.c.value
1 A 1 a1 11 a11 111
2 A 1 a1 11 a12 112
3 A 1 a2 12 <NA> NA
with reasonably sensible types. The packages used are avoidable, if you like.
Is this scalable? Well, not really, without more of the same mess. If anybody else has a less nasty and more scalable approach for dealing with nasty JSON, please post it; I'd be as grateful as the OP.
I think I figured out a way to do this. It seems to work with larger trees. The idea is to unlist the JSON and use the names attribute of the unlisted elements. In this example, if a node has one parent, the name attribute will start with "c.", if it has a parent and a "grandparent", it will list it as "c.c."...etc. So, the code below uses this structure to find the level of nesting and placing the node in the appropriate columns. The rest of the code adds the attributes of the parent nodes and deletes extra rows generated. I know it is not elegant, but I thought it might be useful for others.
library(stringr)
library(jsonlite)
json_file <- ' {"name":"A", "value":"1", "c":
[{"name":"a1", "value":"11", "c":
[{"name":"a11", "value":"111"},
{"name":"a12", "value":"112"}]
},
{"name":"a2", "value":"12"}]
}'
nestedjson <- fromJSON(json_file, simplifyVector = F) #read the json
nAttrPerNode <- 2 #number of attributes per node
strChild <- "c." #determines level of nesting
unnestedjson <- unlist(nestedjson) #convert JSON to unlist
unnestednames <- attr(unnestedjson, "names") #get the names of the cells
depthTree <- (max(str_count(unnestednames, strChild)) + 1) * nAttrPerNode #maximum tree depth
htTree <- length(unnestednames) / nAttrPerNode #maximum tree height (number of branches)
X <- array("", c(htTree, depthTree))
for (nodeht in 1:htTree){ #iterate through the branches and place the nodes based on the count of strChild in the name attribute
nodeIndex <- nodeht * nAttrPerNode
nodedepth <- str_count(unnestednames[nodeIndex], strChild) + 1
X[nodeht, nodedepth * nAttrPerNode - 1] <- unnestedjson[nodeIndex - 1]
X[nodeht, nodedepth * nAttrPerNode] <- unnestedjson[nodeIndex]
}
for (nodeht in 2:htTree){ #repeat the parent node attributes for the children
nodedepth <- 0
repeat{
nodedepth <- nodedepth + 1
startcol <- nodedepth * nAttrPerNode - 1
endcol <- startcol + nAttrPerNode - 1
if (X[nodeht, startcol] == "" & nodedepth < depthTree/nAttrPerNode){
X[nodeht, startcol:endcol] <- X[nodeht-1, startcol:endcol]
} else {
break()
}
}
}
deleteRows <- NULL #Finally delete the rows that only have the parent attributes for nodes that have children
strBranches <- apply(X, 1, paste, collapse="")
for (nodeht in 1:(htTree-1)){
branch2sub <- substr(strBranches[nodeht+1], 1, nchar(strBranches[nodeht]))
if (strBranches[nodeht]==branch2sub){
deleteRows <- c(deleteRows, nodeht)
}
}
deleteRows
X <- X[-deleteRows,]

Community detection with bipartite graph in igraph

I have bipartite list (posts, word categories) with 1000 vertecies and want to use the fast and greedy algorithm for community detection, but I am not sure if I have to run it on the bipartite graph or the bipartite projection.
My bipartite list looks like this:
post word
1 66 2
2 312 1
3 432 7
4 433 7
5 434 1
6 435 5
7 436 1
8 437 4
When I run it without a projection I have problems clustering in the second step:
### Load bipartie list and create graph ###
bipartite_list <- read.csv("bipartite_list_tnf.csv", header = TRUE, sep = ";")
bipartite_graph <- graph.incidence(bipartite_list)
g<-bipartite_graph
fc <- fastgreedy.community(g) ## communities / clusters
set.seed(123)
l <- layout.fruchterman.reingold(g, niter=1000, coolexp=0.5) ## layout
membership(fc)
# 2. checking who is in each cluster
cl <- data.frame(name = fc$post, cluster = fc$membership, stringsAsFactors=F)
cl <- cl[order(cl$cluster),]
cl[cl$cluster==1,]
# 3. preparing data for plot
d <- data.frame(l); names(d) <- c("x", "y")
d$cluster <- factor(fc$membership)
# 4. plot with only nodes, colored by cluster
p <- ggplot(d, aes(x=x, y=y, color=cluster))
pq <- p + geom_point()
pq
Maybe I have to run the communnity detection on a projection? But then I always get I failure because a projection is not a graph object:
bipartite_graph <- graph.incidence(bipartite_list)
#projection (both directions)
projection_word_post <- bipartite.projection(bipartite_graph)
fc <- fastgreedy.community(projection_word_post)
Fehler in fastgreedy.community(projection_word_post) : Not a graph object
I would be glad for help!
When you run without the projection the issue is at:
bipartite_graph <- graph.incidence(bipartite_list)
You need to reshape 'bipartite_list' before applying into graph.incidence() function. Use the below command
tab <- table(bipartite_list)
and rest of the steps are same
g <- graph.incidence(tab,mode=c("all"))
fc <- fastgreedy.community(g)

Scraping html tables into R data frames using the XML package

How do I scrape html tables using the XML package?
Take, for example, this wikipedia page on the Brazilian soccer team. I would like to read it in R and get the "list of all matches Brazil have played against FIFA recognised teams" table as a data.frame. How can I do this?
…or a shorter try:
library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://en.wikipedia.org/wiki/Brazil_national_football_team",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
the picked table is the longest one on the page
tables[[which.max(n.rows)]]
library(RCurl)
library(XML)
# Download page using RCurl
# You may need to set proxy details, etc., in the call to getURL
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
webpage <- getURL(theurl)
# Process escape characters
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
# Parse the html tree, ignoring errors on the page
pagetree <- htmlTreeParse(webpage, error=function(...){})
# Navigate your way through the tree. It may be possible to do this more efficiently using getNodeSet
body <- pagetree$children$html$children$body
divbodyContent <- body$children$div$children[[1]]$children$div$children[[4]]
tables <- divbodyContent$children[names(divbodyContent)=="table"]
#In this case, the required table is the only one with class "wikitable sortable"
tableclasses <- sapply(tables, function(x) x$attributes["class"])
thetable <- tables[which(tableclasses=="wikitable sortable")]$table
#Get columns headers
headers <- thetable$children[[1]]$children
columnnames <- unname(sapply(headers, function(x) x$children$text$value))
# Get rows from table
content <- c()
for(i in 2:length(thetable$children))
{
tablerow <- thetable$children[[i]]$children
opponent <- tablerow[[1]]$children[[2]]$children$text$value
others <- unname(sapply(tablerow[-1], function(x) x$children$text$value))
content <- rbind(content, c(opponent, others))
}
# Convert to data frame
colnames(content) <- columnnames
as.data.frame(content)
Edited to add:
Sample output
Opponent Played Won Drawn Lost Goals for Goals against  % Won
1 Argentina 94 36 24 34 148 150 38.3%
2 Paraguay 72 44 17 11 160 61 61.1%
3 Uruguay 72 33 19 20 127 93 45.8%
...
The rvest along with xml2 is another popular package for parsing html web pages.
library(rvest)
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
file<-read_html(theurl)
tables<-html_nodes(file, "table")
table1 <- html_table(tables[4], fill = TRUE)
The syntax is easier to use than the xml package and for most web pages the package provides all of the options ones needs.
Another option using Xpath.
library(RCurl)
library(XML)
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# Extract table header and contents
tablehead <- xpathSApply(pagetree, "//*/table[#class='wikitable sortable']/tr/th", xmlValue)
results <- xpathSApply(pagetree, "//*/table[#class='wikitable sortable']/tr/td", xmlValue)
# Convert character vector to dataframe
content <- as.data.frame(matrix(results, ncol = 8, byrow = TRUE))
# Clean up the results
content[,1] <- gsub(" ", "", content[,1])
tablehead <- gsub(" ", "", tablehead)
names(content) <- tablehead
Produces this result
> head(content)
Opponent Played Won Drawn Lost Goals for Goals against % Won
1 Argentina 94 36 24 34 148 150 38.3%
2 Paraguay 72 44 17 11 160 61 61.1%
3 Uruguay 72 33 19 20 127 93 45.8%
4 Chile 64 45 12 7 147 53 70.3%
5 Peru 39 27 9 3 83 27 69.2%
6 Mexico 36 21 6 9 69 34 58.3%