Scraping html tables into R data frames using the XML package - html

How do I scrape html tables using the XML package?
Take, for example, this wikipedia page on the Brazilian soccer team. I would like to read it in R and get the "list of all matches Brazil have played against FIFA recognised teams" table as a data.frame. How can I do this?

…or a shorter try:
library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://en.wikipedia.org/wiki/Brazil_national_football_team",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
the picked table is the longest one on the page
tables[[which.max(n.rows)]]

library(RCurl)
library(XML)
# Download page using RCurl
# You may need to set proxy details, etc., in the call to getURL
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
webpage <- getURL(theurl)
# Process escape characters
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
# Parse the html tree, ignoring errors on the page
pagetree <- htmlTreeParse(webpage, error=function(...){})
# Navigate your way through the tree. It may be possible to do this more efficiently using getNodeSet
body <- pagetree$children$html$children$body
divbodyContent <- body$children$div$children[[1]]$children$div$children[[4]]
tables <- divbodyContent$children[names(divbodyContent)=="table"]
#In this case, the required table is the only one with class "wikitable sortable"
tableclasses <- sapply(tables, function(x) x$attributes["class"])
thetable <- tables[which(tableclasses=="wikitable sortable")]$table
#Get columns headers
headers <- thetable$children[[1]]$children
columnnames <- unname(sapply(headers, function(x) x$children$text$value))
# Get rows from table
content <- c()
for(i in 2:length(thetable$children))
{
tablerow <- thetable$children[[i]]$children
opponent <- tablerow[[1]]$children[[2]]$children$text$value
others <- unname(sapply(tablerow[-1], function(x) x$children$text$value))
content <- rbind(content, c(opponent, others))
}
# Convert to data frame
colnames(content) <- columnnames
as.data.frame(content)
Edited to add:
Sample output
Opponent Played Won Drawn Lost Goals for Goals against  % Won
1 Argentina 94 36 24 34 148 150 38.3%
2 Paraguay 72 44 17 11 160 61 61.1%
3 Uruguay 72 33 19 20 127 93 45.8%
...

The rvest along with xml2 is another popular package for parsing html web pages.
library(rvest)
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
file<-read_html(theurl)
tables<-html_nodes(file, "table")
table1 <- html_table(tables[4], fill = TRUE)
The syntax is easier to use than the xml package and for most web pages the package provides all of the options ones needs.

Another option using Xpath.
library(RCurl)
library(XML)
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# Extract table header and contents
tablehead <- xpathSApply(pagetree, "//*/table[#class='wikitable sortable']/tr/th", xmlValue)
results <- xpathSApply(pagetree, "//*/table[#class='wikitable sortable']/tr/td", xmlValue)
# Convert character vector to dataframe
content <- as.data.frame(matrix(results, ncol = 8, byrow = TRUE))
# Clean up the results
content[,1] <- gsub(" ", "", content[,1])
tablehead <- gsub(" ", "", tablehead)
names(content) <- tablehead
Produces this result
> head(content)
Opponent Played Won Drawn Lost Goals for Goals against % Won
1 Argentina 94 36 24 34 148 150 38.3%
2 Paraguay 72 44 17 11 160 61 61.1%
3 Uruguay 72 33 19 20 127 93 45.8%
4 Chile 64 45 12 7 147 53 70.3%
5 Peru 39 27 9 3 83 27 69.2%
6 Mexico 36 21 6 9 69 34 58.3%

Related

Scrape website's Power BI dashboard using R

I have been trying to scrape my local government's Power BI dashboard using R but it seems like it might be impossible. I've read from the Microsoft site that it is not possible to scrable Power BI dashboards but I am going through several forums showing that it is possible, however I am going through a loop
I am trying to scrape the Zip Code tab data from this dashboard:
https://app.powerbigov.us/view?r=eyJrIjoiZDFmN2ViMGEtNzQzMC00ZDU3LTkwZjUtOWU1N2RiZmJlOTYyIiwidCI6IjNiMTg1MTYzLTZjYTMtNDA2NS04NDAwLWNhNzJiM2Y3OWU2ZCJ9&pageName=ReportSectionb438b98829599a9276e2&pageName=ReportSectionb438b98829599a9276e2
I've tried several "techniques" from the given code below
scc_webpage <- xml2::read_html("https://app.powerbigov.us/view?r=eyJrIjoiZDFmN2ViMGEtNzQzMC00ZDU3LTkwZjUtOWU1N2RiZmJlOTYyIiwidCI6IjNiMTg1MTYzLTZjYTMtNDA2NS04NDAwLWNhNzJiM2Y3OWU2ZCJ9&pageName=ReportSectionb438b98829599a9276e2&pageName=ReportSectionb438b98829599a9276e2")
# Attempt using xpath
scc_webpage %>%
rvest::html_nodes(xpath = '//*[#id="pvExplorationHost"]/div/div/exploration/div/explore-canvas-modern/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container-group/transform/div/div[2]/visual-container-modern[1]/transform/div/div[3]/div/visual-modern/div/div/div[2]/div[1]/div[4]/div/div/div[1]/div[1]') %>%
rvest::html_text()
# Attempt using div.<class>
scc_webpage %>%
rvest::html_nodes("div.pivotTableCellWrap cell-interactive tablixAlignRight ") %>%
rvest::html_text()
# Attempt using xpathSapply
query = '//*[#id="pvExplorationHost"]/div/div/exploration/div/explore-canvas-modern/div/div[2]/div/div[2]/div[2]/visual-container-repeat/visual-container-group/transform/div/div[2]/visual-container-modern[1]/transform/div/div[3]/div/visual-modern/div/div/div[2]/div[1]/div[4]/div/div/div[1]/div[1]'
XML::xpathSApply(xml, query, xmlValue)
scc_webpage %>%
html_nodes("ui-view")
But I always either get an output saying character(0) when using xpath and getting the div class and id, or even {xml_nodeset (0)} when trying to go through html_nodes. The weird thing is that it wouldn't show the whole html of the tableau data when I do:
scc_webpage %>%
html_nodes("div")
And this would be the output, leaving the chunk that I needed blank:
{xml_nodeset (2)}
[1] <div id="pbi-loading"><svg version="1.1" class="pulsing-svg-item" xmlns="http://www.w3.org/2000/svg" xmlns:xlink ...
[2] <div id="pbiAppPlaceHolder">\r\n <ui-view></ui-view><root></root>\n</div>
I guess the issue may be because the numbers are within a series of nested div attributes??
The main data I am trying to get are the numbers from the table showing the Zip code, confirmed cases, % total cases, deaths, % total deaths.
If this is possible to do in R or possibly in Python using Selenium, any help with this would be greatly appreciated!!
The problem is that the site you want to analyze relies on JavaScript to run and fetch the content for you. In such a case, httr::GET is of no help to you.
However, since manual work is also not an option, we have Selenium.
The following does what you're looking for:
library(dplyr)
library(purrr)
library(readr)
library(wdman)
library(RSelenium)
library(xml2)
library(selectr)
# using wdman to start a selenium server
selServ <- selenium(
port = 4444L,
version = 'latest',
chromever = '84.0.4147.30', # set this to a chrome version that's available on your machine
)
# using RSelenium to start chrome on the selenium server
remDr <- remoteDriver(
remoteServerAddr = 'localhost',
port = 4444L,
browserName = 'chrome'
)
# open a new Tab on Chrome
remDr$open()
# navigate to the site you wish to analyze
report_url <- "https://app.powerbigov.us/view?r=eyJrIjoiZDFmN2ViMGEtNzQzMC00ZDU3LTkwZjUtOWU1N2RiZmJlOTYyIiwidCI6IjNiMTg1MTYzLTZjYTMtNDA2NS04NDAwLWNhNzJiM2Y3OWU2ZCJ9&pageName=ReportSectionb438b98829599a9276e2&pageName=ReportSectionb438b98829599a9276e2"
remDr$navigate(report_url)
# find and click the button leading to the Zip Code data
zipCodeBtn <- remDr$findElement('.//button[descendant::span[text()="Zip Code"]]', using="xpath")
zipCodeBtn$clickElement()
# fetch the site source in XML
zipcode_data_table <- read_html(remDr$getPageSource()[[1]]) %>%
querySelector("div.pivotTable")
Now we have the page source read into R, probably what you had in mind when you started your scraping task.
From here on it's smooth sailing and merely about converting that xml to a useable table:
col_headers <- zipcode_data_table %>%
querySelectorAll("div.columnHeaders div.pivotTableCellWrap") %>%
map_chr(xml_text)
rownames <- zipcode_data_table %>%
querySelectorAll("div.rowHeaders div.pivotTableCellWrap") %>%
map_chr(xml_text)
zipcode_data <- zipcode_data_table %>%
querySelectorAll("div.bodyCells div.pivotTableCellWrap") %>%
map(xml_parent) %>%
unique() %>%
map(~ .x %>% querySelectorAll("div.pivotTableCellWrap") %>% map_chr(xml_text)) %>%
setNames(col_headers) %>%
bind_cols()
# tadaa
df_final <- tibble(zipcode = rownames, zipcode_data) %>%
type_convert(trim_ws = T, na = c(""))
The resulting df looks like this:
> df_final
# A tibble: 15 x 5
zipcode `Confirmed Cases ` `% of Total Cases ` `Deaths ` `% of Total Deaths `
<chr> <dbl> <chr> <dbl> <chr>
1 63301 1549 17.53% 40 28.99%
2 63366 1364 15.44% 38 27.54%
3 63303 1160 13.13% 21 15.22%
4 63385 1091 12.35% 12 8.70%
5 63304 1046 11.84% 3 2.17%
6 63368 896 10.14% 12 8.70%
7 63367 882 9.98% 9 6.52%
8 534 6.04% 1 0.72%
9 63348 105 1.19% 0 0.00%
10 63341 84 0.95% 1 0.72%
11 63332 64 0.72% 0 0.00%
12 63373 25 0.28% 1 0.72%
13 63386 17 0.19% 0 0.00%
14 63357 13 0.15% 0 0.00%
15 63376 5 0.06% 0 0.00%

Scraping dynamic table in R

I am stuck on a simple web scrape.
My goal is to scrape Morningstar.com to retrieve the education of the managers associated to a fund name.
First off, let me say that I am not familiar at all with this operation. However, I did my best to provide some code.
For example, consider the following webpage
http://financials.morningstar.com/fund/management.html?t=AALGX&region=usa&culture=en_US
The problem is that the page dynamically loads the section I am targeting, so it doesn't actually get pulled in by read_html()
So what I did was to access to the data loaded in my section of interest.
Specifically, I did:
# edit: added packages required
library(xml2)
library(rvest)
library(stringi)
# original code
tmp_url <- "http://financials.morningstar.com/fund/management.html?t=AALGX&region=usa&culture=en_US"
pg <- read_html(tmp_url)
tmp <- length(html_nodes(pg, xpath=".//script[contains(., 'function loadManagerInfo()')]"))
html_nodes(pg, xpath=".//script[contains(., 'function loadManagerInfo()')]") %>%
html_text() %>%
stri_split_lines() %>%
.[[1]] -> js_lines
idx <- which(stri_detect_fixed(js_lines, '\t\t\"//financials.morningstar.com/oprn/c-managers.action?&t='))
start <- nchar("\t\t\"//financials.morningstar.com/oprn/c-managers.action?&t=")+1
id <- substr(js_lines[idx],start, start+9)
tab <- read_html(paste0("http://financials.morningstar.com/oprn/c-managers.action?&t=",id,"&region=usa&culture=en-US&cur=&callback=jsonp1523529017966&_=1523529019244"), options = "HUGE")
The object tab contains the information I need.
What I need to do now is to create a dataframe associating to each manager name, his or her manager education.
I could try to do this by transforming my object in a string, then extracting the characters following the word "Education".
Though, this looks extremely inefficient.
I was wondering if anyone can provide some guidance.
This thing really is a mess - nice work getting the links and downloding the info.
Poking around a lot and taking various detours this is the best I could come up:
Clean Up
First there is some cleanup to do. Instead of directly downloading and parsing the document in one step we will:
download the document as text
clean up the text a little to get the JSON
parse the JSON
extract the HTML item
do some further cleaning
finally parse the HTML
url <-
paste0(
"http://financials.morningstar.com/oprn/c-managers.action?&t=",
id,
"&region=usa&culture=en-US&cur=&callback=jsonp1523529017966&_=1523529019244"
)
txt <-
readLines(url, warn = FALSE)
json <-
txt %>%
gsub("^jsonp\\d+\\(", "", .) %>%
gsub("\\)$", "", .)
json_parsed <-
jsonlite::fromJSON(json)
html_clean <-
json_parsed$html %>%
gsub("\t", "", .)
html_parsed <-
read_html(html_clean)
First Round of Node Extraction
Next we use some black magic node extraction trickery. Basically the trick goes like this: If we have a node set (the thing you get when using html_nodes) we can use further XPath queries to drill down.
The first node set (cvs) captures the basic path to the CV entries in the table.
The second node set (info_tmp) drills down a little further to get the those part of the CV entries where further information ("Other Assets Managed", "Education", ... etc) is stored.
cvs <-
html_parsed %>%
html_nodes(xpath = "/html/body/table/tbody/tr[not(#align='left')]")
info_tmp <-
cvs %>%
html_nodes(xpath = "td/table/tbody")
Building up Data.Frame 1
There is little problem with the table. Each CV entry lives in its own table row. For name, from, to and description there is always exactly one item per CV entry but for "Other Assets Managed", "Education", ... etc this is not true.
Therefore, information extraction is done in two parts.
df <-
cvs %>%
lapply(
FUN =
function(x){
tmp <-
x %>%
html_nodes(xpath = "th") %>%
html_text() %>%
gsub(" +", "", .)
data.frame(
name = stri_extract(tmp, regex = "[. \\w]+"),
from = stri_extract(tmp, regex = "\\d{2}/\\d{2}/\\d{4}"),
to = stri_extract(tmp, regex = "\\d{2}/\\d{2}/\\d{4}")
)
}
) %>%
do.call(rbind, .)
df$description <-
info_tmp %>%
html_nodes(xpath = "tr[1]/td[1]") %>%
html_text()
df$cv_id <- seq_len(nrow(df))
Building Up Data.Frame 2
Now some more html nodes trickery ... If we use html_nodes() the result set of html_nodes() we get all matching and none of the none matching nodes. This is a problem since we might get 1, 0 or multiple nodes per node set node basically destroying any information about where those newly selected nodes came from.
There is a solution however: We can use lapply to query each element of an node set independently from the others and therewith preserving information about the original structure.
extract_key_value_pairs <-
function(i, info_tmp){
cv_id <-
seq_along(info_tmp)
key <-
lapply(
info_tmp,
function(x){
tmp <-
x %>%
html_nodes(xpath = paste0("tr[",i,"]/td[1]")) %>%
html_text()
if ( length(tmp) == 0 ) {
return("")
}else{
return(tmp)
}
}
)
value <-
lapply(
info_tmp,
function(x){
tmp <-
x %>%
html_nodes(xpath = paste0("tr[",i,"]/td[2]")) %>%
html_text() %>%
stri_trim_both() %>%
stri_split(fixed = "\n") %>%
lapply(X = ., stri_trim_both)
if ( length(tmp) == 0 ) {
return("")
}else{
return(unlist(tmp))
}
}
)
df <-
mapply(
cv_id = cv_id,
key = key,
value = value,
FUN =
function(cv_id, key, value){
data.frame(
cv_id = cv_id,
key = key,
value = value
)
},
SIMPLIFY = FALSE
) %>%
do.call(rbind, .)
df[df$key != "",]
}
df2 <-
lapply(
X = c(3, 5, 7),
FUN = extract_key_value_pairs,
info_tmp = info_tmp
) %>%
do.call(rbind, .)
Results
df
## name from to description cv_id
## 1 Kurt J. Lauber 03/20/2013 03/20/2013 Mr. Lauber ... 1
## 2 Noah J. Monsen 02/28/2018 02/28/2018 Mr. Monsen ... 2
## 3 Lauri Brunner 09/30/2018 09/30/2018 Ms. Brunne ... 3
## 4 Darren M. Bagwell 02/29/2016 02/29/2016 Darren M. ... 4
## 5 David C. Francis 10/07/2011 10/07/2011 Francis is ... 5
## 6 Michael A. Binger 04/14/2010 04/14/2010 Binger has ... 6
## 7 David E. Heupel 04/14/2010 04/14/2010 Mr. Heupel ... 7
## 8 Matthew D. Finn 03/30/2007 03/30/2007 Mr. Finn h ... 8
## 9 Scott Vergin 03/30/2007 03/30/2007 Vergin has ... 9
## 10 Frederick L. Plautz 11/01/1995 11/01/1995 Plautz has ... 10
## 11 Clyde E. Bartter 01/01/1994 01/01/1994 Bartter is ... 11
## 12 Wayne C. Stevens 01/01/1994 01/01/1994 Stevens is ... 12
## 13 Julian C. Ball 07/16/1987 07/16/1987 Ball is a ... 13
df2
## cv_id key value
## 1 Other Assets Managed
## 2 Other Assets Managed
## 3 Other Assets Managed
## 4 Certification CFA
## 4 Other Assets Managed
## 5 Certification CFA
## 5 Education M.B.A. University of Pittsburgh, 1978
## 5 Education B.A. University of Pittsburgh, 1977
## 5 Other Assets Managed
## 6 Certification CFA
## 6 Education M.B.A. University of Minnesota, 1991
## 6 Education B.S. University of Minnesota, 1987
## 6 Other Assets Managed
## 7 Other Assets Managed
## 8 Certification CFA
## 8 Education B.A. University of Pennsylvania, 1984
## 8 Education M.B.A. University of Michigan, 1990
## 8 Other Assets Managed
## 9 Certification CFA
## 9 Education M.B.A. University of Minnesota, 1980
## 9 Education B.A. St. Olaf College, 1976
## 9 Other Assets Managed
## 10 Education M.S. University of Wisconsin, 1981
## 10 Education B.B.A. University of Wisconsin, 1979
## 10 Other Assets Managed
## 11 Certification CFA
## 11 Education M.B.A. Western Reserve University, 1964
## 11 Education B.A. Baldwin-Wallace College, 1953
## 11 Other Assets Managed
## 12 Certification CFA
## 12 Education M.B.A. University of Wisconsin,
## 12 Education B.B.A. University of Miami,
## 12 Other Assets Managed
## 13 Certification CFA
## 13 Education B.A. Kent State University, 1974
## 13 Education J.D. Cleveland State University, 1984
## 13 Other Assets Managed
I don't have a solution, as this is not an area I have worked with before. However, with brute force you can probably get the table, assuming you have a list of rules that can parse the text to a data frame.
Thought I'd share what I have though
# get the text
f <- xml_text(tab)
# split up, this bit is tricky..
split_f <- strsplit(f, split="\\\\t", perl=TRUE)[[1]]
split_f <- strsplit(split_f, split="\\\\n", perl=TRUE)
split_f <- unlist(split_f)
split_f <- trimws(split_f)
# find ones to remove
sort(table(split_f), decreasing = T)[1:5]
split_f <- split_f[split_f!="—"]
split_f <- split_f[split_f!=""]
# manually found where to split
keep <- split_f[2:108]
# text looks ok, but would need rules to extract the rows in to a data.frame
View(keep)

Trouble using rvest on nested tables

I'm having an issue trying to get Rankings from the Freeride World Tour website.
I tried first to get a CSS code for rvest using selectorGadget in Chrome but but can only get the riders and their overall score. What I'm interested in is getting the points a rider scored in each heat. I'm new to web-scraping and CSS/HTML so please hang in there with me.
# Get the website url
url <- read_html("https://www.freerideworldtour.com/rankings-detailed?season=165&competition=2&discipline=38")
Download everything from the page,
(all_text <- url %>%
html_nodes("div") %>%
html_text())
then look for Kristofer Turdell's first score of 2500 pts. grep("2500 pts.", all_text) but I find...nothing?
When I right-click the 2500 pts. on the website and select "Inspect" I can see that the html code for this section is:
<div class="field__item even">2500 pts.</div>
So I tried to use the div class:
url %>%
html_nodes(".field__item.even:) %>%
html_text()
This only returns the overall score for the participants (e.g. Kristofer Turdell 7870 pts.).
Next, I tried using the right-click option to save Xpath from "Inspect".
url %>%
html_nodes(xpath = "//*[#id="page-content"]/div/div/div[2]/div/div/div/div[1]/div[2]/div/div/div[1]/div/div[4]/div/div/div") %>%
html_text()
I'm not having any luck on this so I'd really appreciate your help.
url %>%
html_node("div.panel-second")%>%
html_text() %>%
gsub("\\s*\\n+\\s*",";",.)%>%
gsub("pts.","\n",.)%>%
read.table(text=.,fill=T,sep=";",row.names = NULL)%>%
subset(select=3:4)%>%na.omit()
V3 V4
1 Kristofer Turdell 7870
2 Markus Eder 7320
3 Mickael Bimboes 6930
4 Loic Collomb-Patton 6660
5 Yann Rausis 6290
6 Berkeley Patterson 5860
7 Leo Slemett 5835
8 Ivan Malakhov 5800
9 Craig Murray 5705
10 Logan Pehota 5655
11 Reine Barkered 5470
12 Grifen Moller 4765
13 Sam Lee 4580
14 Ryan Faye 3210
15 Conor Pelton 3185
16 George Rodney 3115
17 Taisuke Kusunoki 3060
18 Trace Cooke 2905
19 Aymar Navarro 2855
20 Felix Wiemers 2655
21 Fabio Studer 2305
22 Stefan Hausl 2240
23 Drew Tabke 1880
24 Carl Regnér Eriksson 1310
Writing that much code in the comments was awful, so here goes. You can store the scraped data into a dataframe and not be limited to printing it to the console:
library(tidyverse)
library(magrittr)
library(rvest)
url_base <- "https://www.freerideworldtour.com/rider/"
riders <- c("kristofer-turdell", "markus-eder", "mickael-bimboes")
output <- data_frame()
for (i in riders) {
temp <- read_html(paste0(url_base, i)) %>%
html_node("div") %>%
html_text() %>%
gsub("\\s*\\n+\\s*", ";", .) %>%
gsub("pts.", "\n", .) %>%
read.table(text = ., fill = T, sep = ";", row.names = NULL,
col.names = c("Drop", "Ranking", "FWT", "Events", "Points")) %>%
subset(select = 2:5) %>%
dplyr::filter(
!is.na(as.numeric(as.character(Ranking))) &
as.character(Points) != ""
) %>%
dplyr::mutate(name = i)
output <- bind_rows(output, temp)
}
I put in parts such as as.character(Points) != "" to exclude the sum of points (such as in Mickael Bimboe's 6930 pts) and not individual scores.
Again, much credit goes to #Onyambu though, many lines are borrowed from his answer.

Parsing JSON URL in R with different number of fields

I'm having a lot of trouble trying to read some JSON data obtained from a URL in R. I'm able to read in the data, and call on each observation to get the values (as characters which is fine), but I can't seem to find a way to get the data in a table format (basically like in excel).
I've tried to create a loop which calls on each field to place it in an empty matrix, however not every object has the same number of fields (ie. some values have Label1 and Label2, while others just have Label1). I get the error that the subscipts are out of bounds. What I was thinking was to make a conditional statement whereas if the field existed then the value of the field would be put in the data matrix, and if the field does not exist then I would insert an NA. I get a subscript error automatically though and cannot do the conditional evaluation - I've looked to see if I can coerce an error to become an NA, but I don't think this is possible.
I'm starting the index from j=3, since the first two observations in the JSON code are not needed for me. My problem is that for example "json$poi[[j]]$label[[2]]$value" may not exist for every observation and I automatically get an error when the code comes across the first observation missing this field.
The data is quite big - around 4480 observations with up to 20 fields each. I only require the 9 fields I have listed however. Here is a link to the data URL - it may take some time to load. Im quite new to coding, and especially trying to deal with JSON files, so my apology if this has a simple solution that I'm not seeing.
Thanks!
http://tourism.citysdk.cm-lisboa.pt/pois/?limit=-1
library(rjson)
library(RCurl)
json <- fromJSON(getURL('http://tourism.citysdk.cm-lisboa.pt/pois/?limit=-1'))
ljson <- length(json$poi)-2
data <- matrix(data=NA, nrow=ljson, ncol=9)
for(i in 1:ljson)
{
j <- i+2
d1 <- json$poi[[j]]$location$point[[1]]$Point$posList
d2 <- json$poi[[j]]$label[[1]]$value
d3 <- json$poi[[j]]$label[[2]]$value
d4 <- json$poi[[j]]$category[[1]]$value
d5 <- json$poi[[j]]$category[[2]]$value
d6 <- json$poi[[j]]$id
d7 <- json$poi[[j]]$author$value
d8 <- json$poi[[j]]$license$value
d9 <- json$poi[[j]]$description[[1]]$value
if(exists("d1") == TRUE){
d1 <- json$poi[[j]]$location$point[[1]]$Point$posList
} else {
d1 <- NA
}
if(exists("d2") == TRUE){
d2 <- json$poi[[j]]$label[[1]]$value
} else {
d2 <- NA
}
if(exists("d3") == TRUE){
d3 <- json$poi[[j]]$label[[2]]$value
} else {
d3 <- NA
}
if(exists("d4") == TRUE){
d4 <- json$poi[[j]]$category[[1]]$value
} else {
d4 <- NA
}
if(exists("d5") == TRUE){
d5 <- json$poi[[j]]$category[[2]]$value
} else {
d5 <- NA
}
if(exists("d6") == TRUE){
d6 <- json$poi[[j]]$id
} else {
d6 <- NA
}
if(exists("d7") == TRUE){
d7 <- json$poi[[j]]$author$value
} else {
d7 <- NA
}
if(exists("d8") == TRUE){
d8 <- json$poi[[j]]$license$value
} else {
d8 <- NA
}
if(exists("d9") == TRUE){
d9 <- json$poi[[j]]$description[[1]]$value
} else {
d9 <- NA
}
data[i,] <- rbind(c(d1,d2,d3,d4,d5,d6,d7,d8,d9))
}
For JSON & XML list structures str is your friend! You can use that to inspect all or portions of a list structure. sapply on individual components to extract is probably better than the for construct and you'll need to handle NULLs and missing sub-structure components to build a data frame from that JSON (and many JSON files, actually). The following gets you started, but you still have some work to do:
# simplify extraction (saves typing, too)
poi <- json$poi
# start at 3rd element
poi <- poi[3:length(poi)]
# have to do some special checking since the value isn't always there
poi_points <- sapply(poi, function(x) {
if ("point" %in% names(x$location) & length(x$location$point) > 0) {
x$location$point[[1]]$Point$posList
} else {
NA
}
})
# this removes NULLs which the data.frame call won't like later
poi_description <- sapply(poi, function(x) {
if (is.null(x$description[[1]]$value)) {
NA
} else {
x$description[[1]]$value
}
})
# this removes NULLs which the data.frame call won't like later
poi_category <- sapply(poi, function(x) {
if (is.null(x$category[[1]]$value)) {
NA
} else {
x$category[[1]]$value
}
})
# simpler extractions
poi_label <- sapply(poi, function(x) x$label[[1]]$value)
poi_id <- sapply(poi, function(x) x$id)
poi_author <- sapply(poi, function(x) x$author$value)
poi_license <- sapply(poi, function(x) x$license$value)
# make a data frame
poi <- data.frame(poi_label, poi_category, poi_id, poi_points, poi_author, poi_license, poi_description)
str(poi)
## 'data.frame': 4482 obs. of 7 variables:
## $ poi_label : Factor w/ 4482 levels "\"Bloco das Águas Livres\", edifício de habitação, comércio e serviços",..: 363 765 764 1068 174 419 461 762 420 412 ...
## $ poi_category : Factor w/ 129 levels "Acessórios de Uso Pessoal",..: 33 33 33 33 33 33 123 33 33 33 ...
## $ poi_id : Factor w/ 4482 levels "52d7bf4d723e8e0b0cc08b69",..: 2 3 4 5 7 8 15 16 17 18 ...
## $ poi_points : Factor w/ 3634 levels "38.405892 -9.93503",..: 975 244 478 416 301 541 2936 2975 2850 2830 ...
## $ poi_author : Factor w/ 1 level "CitySDK": 1 1 1 1 1 1 1 1 1 1 ...
## $ poi_license : Factor w/ 1 level "open-data": 1 1 1 1 1 1 1 1 1 1 ...
## $ poi_description: Factor w/ 2831 levels "","\n","\n\n",..: 96 1051 NA NA 777 1902 NA 1038 81 82 ...
##

Community detection with bipartite graph in igraph

I have bipartite list (posts, word categories) with 1000 vertecies and want to use the fast and greedy algorithm for community detection, but I am not sure if I have to run it on the bipartite graph or the bipartite projection.
My bipartite list looks like this:
post word
1 66 2
2 312 1
3 432 7
4 433 7
5 434 1
6 435 5
7 436 1
8 437 4
When I run it without a projection I have problems clustering in the second step:
### Load bipartie list and create graph ###
bipartite_list <- read.csv("bipartite_list_tnf.csv", header = TRUE, sep = ";")
bipartite_graph <- graph.incidence(bipartite_list)
g<-bipartite_graph
fc <- fastgreedy.community(g) ## communities / clusters
set.seed(123)
l <- layout.fruchterman.reingold(g, niter=1000, coolexp=0.5) ## layout
membership(fc)
# 2. checking who is in each cluster
cl <- data.frame(name = fc$post, cluster = fc$membership, stringsAsFactors=F)
cl <- cl[order(cl$cluster),]
cl[cl$cluster==1,]
# 3. preparing data for plot
d <- data.frame(l); names(d) <- c("x", "y")
d$cluster <- factor(fc$membership)
# 4. plot with only nodes, colored by cluster
p <- ggplot(d, aes(x=x, y=y, color=cluster))
pq <- p + geom_point()
pq
Maybe I have to run the communnity detection on a projection? But then I always get I failure because a projection is not a graph object:
bipartite_graph <- graph.incidence(bipartite_list)
#projection (both directions)
projection_word_post <- bipartite.projection(bipartite_graph)
fc <- fastgreedy.community(projection_word_post)
Fehler in fastgreedy.community(projection_word_post) : Not a graph object
I would be glad for help!
When you run without the projection the issue is at:
bipartite_graph <- graph.incidence(bipartite_list)
You need to reshape 'bipartite_list' before applying into graph.incidence() function. Use the below command
tab <- table(bipartite_list)
and rest of the steps are same
g <- graph.incidence(tab,mode=c("all"))
fc <- fastgreedy.community(g)