R loops with JSON API Source - json

I'm trying to get data for books prices from API (http://www.knigoed.info/api-prices.html) based on ISBN.
The idea is to submit vector of ISBNs to the function to get a data frame with all available info (or at least Data.Frame with prices from different vendors)
isbns<- c("9785170922789", "9785170804801", "9785699834174", "9785699717255", "9785170869237")
getISBNprice <- function(ISBN, source="http://www.knigoed.info/api/Prices?code=") {
pathA <- source
for (i in 1:length(ISBN)) {
ISB <- ISBN[i]
AAA <- paste(pathA, ISB, "&sortPrice=DESC&country=RU", sep="")
document <- fromJSON(AAA, flatten = FALSE)
dfp <- document$prices
dfp <- cbind(dfp,ISB )
# dfp <- cbind(dfp,BookID=document$bookId)
# dfp <- cbind(dfp,Title=document$title)
# dfp <- cbind(dfp,Author=document$author)
# dfp <- cbind(dfp,Publisher=document$publisher)
# dfp <- cbind(dfp,Series=document$series)
# dfp <- cbind(dfp,Picture=document$picture)
if (!exists("AAAA")) {AAAA<- dfp} else {bind_rows(AAAA, dfp) }
}
AAAA
}
But the function returns error:
1. In bind_rows_(x, .id) : Unequal factor levels: coercing to character
2: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
3: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
4: In bind_rows_(x, .id) : Unequal factor levels: coercing to character

It's easiest make a list from the start, which will make simplifying later easier. The purrr package can make working with lists much easier, though the usages here can be replaced with base's lapply and mapply/Map if you prefer.
library(purrr)
# Paste is vectorized, so make a list of URLs all at once.
# `httr` can make a URL out of a list of named parameters, if it's more convenient.
results <- paste0("http://www.knigoed.info/api/Prices?code=",
isbns,
"&sortPrice=DESC&country=RU") %>%
# Iterate over vector of URLs, using fromJSON to pull and parse the request.
# map, like lapply, will put the results into a list.
map(jsonlite::fromJSON, flatten = FALSE)
# Grab "prices" element of each top-level list element
results %>% map('prices') %>%
# Iterate in parallel (like mapply/Map) over prices and isbns, making a data.frame of
# each. map2_df will coerce the resulting list of data.frames to a single data.frame.
map2_df(isbns, ~data.frame(isbn = .y, .x, stringsAsFactors = FALSE)) %>%
# For pretty printing
tibble::as_data_frame()
## # A tibble: 36 x 10
## isbn shopId name domain
## <chr> <chr> <chr> <chr>
## 1 9785170922789 29 Магистр booka.ru
## 2 9785170922789 3 Лабиринт labirint.ru
## 3 9785170922789 20 LitRes.ru litres.ru
## 4 9785170804801 29 Магистр booka.ru
## 5 9785170804801 2 Read.ru read.ru
## 6 9785170804801 3 Лабиринт labirint.ru
## 7 9785170804801 63 Эксмо eksmo.ru
## 8 9785170804801 1 OZON.ru ozon.ru
## 9 9785170804801 4 My-shop.ru my-shop.ru
## 10 9785170804801 1 OZON.ru ozon.ru
## # ... with 26 more rows, and 6 more variables: url <chr>, available <lgl>, downloadable <lgl>,
## # priceValue <dbl>, priceSuffix <chr>, year <int>

Related

How can I filter out numbers from an html table in R?

I am currently working on a forecasting model and to do this I would like to import data from an HTML website into R and save the values-part of the data set into a new list.
I have used the following approach in R:
# getting website data:
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
document <- htmlParse(GET(link, user_agent("Mozilla")))
removeNodes(getNodeSet(document,"//*/comment()"))
doc.tables<-readHTMLTable(document)
# show BID/ASK block:
doc.tables[2]
Which (doc.tables[2]) gives me in this case the result:
$`NULL`
Bid 0,765
1 Ask 0,80
How can i filter out the numbers (0,765 & 0,80) of the table, to save it into a list?
The issue is the 0.765 is actually the name of your data.frame column.
Your data frame being doc.tables[[2]]
You can grab the name by calling names(doc.tables[[2]])[2])
store that as a variable like name <- names(doc.tables[[2]])[2])
then you can grab the 0,80 by using doc.tables[[2]][[2]], store that as a variable if you like.
Final code should look like... my_list <- list(name, doc.tables[[2]][[2]])
Here is a way with rvest, not package XML.
The code below uses two more packages, stringr and readr, to extract the values and their names.
library(httr)
library(rvest)
library(dplyr)
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
page <- read_html(link)
tbl <- page %>%
html_elements("tr") %>%
html_text() %>%
.[3:4] %>%
stringr::str_replace_all(",", ".")
tibble(name = stringr::str_extract(tbl, "Ask|Bid"),
value = readr::parse_number(tbl))
#> # A tibble: 2 x 2
#> name value
#> <chr> <dbl>
#> 1 Bid 0.765
#> 2 Ask 0.8
Created on 2022-03-26 by the reprex package (v2.0.1)
Without saving the pipe result to a temporary object, tbl, the pipe can continue as below.
library(httr)
library(rvest)
library(stringr)
suppressPackageStartupMessages(library(dplyr))
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
page <- read_html(link)
page %>%
html_elements("tr") %>%
html_text() %>%
.[3:4] %>%
str_replace_all(",", ".") %>%
tibble(name = str_extract(., "Ask|Bid"),
value = readr::parse_number(.)) %>%
.[-1]
#> # A tibble: 2 x 2
#> name value
#> <chr> <dbl>
#> 1 Bid 0.765
#> 2 Ask 0.8
Created on 2022-03-27 by the reprex package (v2.0.1)
This is building on Jahi Zamy’s observation that some of your data are showing up as column names and on the example code in the question.
library(httr)
library(XML)
# getting website data:
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
document <- htmlParse(GET(link, user_agent("Mozilla")))
# readHTMLTable() assumes tables have a header row by default,
# but these tables do not, so use header=FALSE
doc.tables <- readHTMLTable(document, header=FALSE)
# Extract column from BID/ASK table
BidAsk = doc.tables1[[2]][,2]
# Replace commas with point decimal separator
BidAsk = as.numeric(gsub(",", ".", BidAsk))
# Convert to numeric
BidAsk = as.numeric(BidAsk)

How to export an R object to HTML file without writing first to .Rmd?

I want to print the content of R objects and save it as a rendered HTML file. To this end, I find the pander package useful. However, I don't know how to go from the markdown pander::pander() generates to an HTML file in a single code execution, without intermediate save to .Rmd file. This question comes after a different question I posted about getting a png export for the same process.
I'm going to use the same example I used in the other post.
Example
Let's say that we have the mtcars data, and we want to extract some information about it:
Number of rows in the data
Average of mpg
The factor levels available in cyl
Regression summary for predicting mpg ~ cyl
To this end, I'll compute each of the above and assign them to objects. Finally, I'll bundle all the info in a list object.
library(dplyr)
library(tibble)
library(broom)
number_of_rows <- nrow(mtcars)
mpg_mean <- mean(mtcars$mpg)
cyl_levels <- mtcars %>% select(cyl) %>% unique() %>% remove_rownames()
model_summary <- lm(mpg ~ cyl, mtcars) %>% broom::tidy()
my_data_summary <- lst(number_of_rows,
mpg_mean,
cyl_levels,
model_summary)
> my_data_summary
## $number_of_rows
## [1] 32
## $mpg_mean
## [1] 20.09062
## $cyl_levels
## cyl
## 1 6
## 2 4
## 3 8
## $model_summary
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 37.9 2.07 18.3 8.37e-18
## 2 cyl -2.88 0.322 -8.92 6.11e-10
So how can I print the contents of my_data_summary as an HTML file?
I thought that pander::pander() is a good candidate because it accepts the list object in its entirety.
library(pander)
> pander(my_data_summary)
* **number_of_rows**: _32_
* **mpg_mean**: _20.09_
* **cyl_levels**:
-----
cyl
-----
6
4
8
-----
* **model_summary**:
------------------------------------------------------------
term estimate std.error statistic p.value
------------- ---------- ----------- ----------- -----------
(Intercept) 37.88 2.074 18.27 8.369e-18
cyl -2.876 0.3224 -8.92 6.113e-10
------------------------------------------------------------
<!-- end of list -->
But now I'm stuck. How could I progress from pander(my_data_summary) to an .html file in my directory? I want an .html that when opened, looks like this:
Is there a way to write a single code that goes directly from my_data_summary to an .html export without generating an intermediate file?
Thanks!
EDIT 1
#Konrad commented that it's impossible to progress from pander's markdown output directly. So if we ignore pander, is there any way to export the contents of my_data_summary to an HTML file directly?
EDIT 2
I made some progress with pander.
info_as_chr_vec <-
my_data_summary %>%
pander::pander_return() %>%
paste(., collapse = "\n")
info_as_chr_vec %>%
pander::Pandoc.brew(text = ., output = ~path/to/html/output/file, convert = "html")
Which writes an html file. This seems OK:
But I dislike the unnecessary table of contents and the footer. Nor do I know how to change the overall style of the tables. I'd much rather to have something more plain. Obviously, I could have edited the html code after the fact. But the whole point is to get the desired html file in the original execution of the R code.

Scraping dynamic table in R

I am stuck on a simple web scrape.
My goal is to scrape Morningstar.com to retrieve the education of the managers associated to a fund name.
First off, let me say that I am not familiar at all with this operation. However, I did my best to provide some code.
For example, consider the following webpage
http://financials.morningstar.com/fund/management.html?t=AALGX&region=usa&culture=en_US
The problem is that the page dynamically loads the section I am targeting, so it doesn't actually get pulled in by read_html()
So what I did was to access to the data loaded in my section of interest.
Specifically, I did:
# edit: added packages required
library(xml2)
library(rvest)
library(stringi)
# original code
tmp_url <- "http://financials.morningstar.com/fund/management.html?t=AALGX&region=usa&culture=en_US"
pg <- read_html(tmp_url)
tmp <- length(html_nodes(pg, xpath=".//script[contains(., 'function loadManagerInfo()')]"))
html_nodes(pg, xpath=".//script[contains(., 'function loadManagerInfo()')]") %>%
html_text() %>%
stri_split_lines() %>%
.[[1]] -> js_lines
idx <- which(stri_detect_fixed(js_lines, '\t\t\"//financials.morningstar.com/oprn/c-managers.action?&t='))
start <- nchar("\t\t\"//financials.morningstar.com/oprn/c-managers.action?&t=")+1
id <- substr(js_lines[idx],start, start+9)
tab <- read_html(paste0("http://financials.morningstar.com/oprn/c-managers.action?&t=",id,"&region=usa&culture=en-US&cur=&callback=jsonp1523529017966&_=1523529019244"), options = "HUGE")
The object tab contains the information I need.
What I need to do now is to create a dataframe associating to each manager name, his or her manager education.
I could try to do this by transforming my object in a string, then extracting the characters following the word "Education".
Though, this looks extremely inefficient.
I was wondering if anyone can provide some guidance.
This thing really is a mess - nice work getting the links and downloding the info.
Poking around a lot and taking various detours this is the best I could come up:
Clean Up
First there is some cleanup to do. Instead of directly downloading and parsing the document in one step we will:
download the document as text
clean up the text a little to get the JSON
parse the JSON
extract the HTML item
do some further cleaning
finally parse the HTML
url <-
paste0(
"http://financials.morningstar.com/oprn/c-managers.action?&t=",
id,
"&region=usa&culture=en-US&cur=&callback=jsonp1523529017966&_=1523529019244"
)
txt <-
readLines(url, warn = FALSE)
json <-
txt %>%
gsub("^jsonp\\d+\\(", "", .) %>%
gsub("\\)$", "", .)
json_parsed <-
jsonlite::fromJSON(json)
html_clean <-
json_parsed$html %>%
gsub("\t", "", .)
html_parsed <-
read_html(html_clean)
First Round of Node Extraction
Next we use some black magic node extraction trickery. Basically the trick goes like this: If we have a node set (the thing you get when using html_nodes) we can use further XPath queries to drill down.
The first node set (cvs) captures the basic path to the CV entries in the table.
The second node set (info_tmp) drills down a little further to get the those part of the CV entries where further information ("Other Assets Managed", "Education", ... etc) is stored.
cvs <-
html_parsed %>%
html_nodes(xpath = "/html/body/table/tbody/tr[not(#align='left')]")
info_tmp <-
cvs %>%
html_nodes(xpath = "td/table/tbody")
Building up Data.Frame 1
There is little problem with the table. Each CV entry lives in its own table row. For name, from, to and description there is always exactly one item per CV entry but for "Other Assets Managed", "Education", ... etc this is not true.
Therefore, information extraction is done in two parts.
df <-
cvs %>%
lapply(
FUN =
function(x){
tmp <-
x %>%
html_nodes(xpath = "th") %>%
html_text() %>%
gsub(" +", "", .)
data.frame(
name = stri_extract(tmp, regex = "[. \\w]+"),
from = stri_extract(tmp, regex = "\\d{2}/\\d{2}/\\d{4}"),
to = stri_extract(tmp, regex = "\\d{2}/\\d{2}/\\d{4}")
)
}
) %>%
do.call(rbind, .)
df$description <-
info_tmp %>%
html_nodes(xpath = "tr[1]/td[1]") %>%
html_text()
df$cv_id <- seq_len(nrow(df))
Building Up Data.Frame 2
Now some more html nodes trickery ... If we use html_nodes() the result set of html_nodes() we get all matching and none of the none matching nodes. This is a problem since we might get 1, 0 or multiple nodes per node set node basically destroying any information about where those newly selected nodes came from.
There is a solution however: We can use lapply to query each element of an node set independently from the others and therewith preserving information about the original structure.
extract_key_value_pairs <-
function(i, info_tmp){
cv_id <-
seq_along(info_tmp)
key <-
lapply(
info_tmp,
function(x){
tmp <-
x %>%
html_nodes(xpath = paste0("tr[",i,"]/td[1]")) %>%
html_text()
if ( length(tmp) == 0 ) {
return("")
}else{
return(tmp)
}
}
)
value <-
lapply(
info_tmp,
function(x){
tmp <-
x %>%
html_nodes(xpath = paste0("tr[",i,"]/td[2]")) %>%
html_text() %>%
stri_trim_both() %>%
stri_split(fixed = "\n") %>%
lapply(X = ., stri_trim_both)
if ( length(tmp) == 0 ) {
return("")
}else{
return(unlist(tmp))
}
}
)
df <-
mapply(
cv_id = cv_id,
key = key,
value = value,
FUN =
function(cv_id, key, value){
data.frame(
cv_id = cv_id,
key = key,
value = value
)
},
SIMPLIFY = FALSE
) %>%
do.call(rbind, .)
df[df$key != "",]
}
df2 <-
lapply(
X = c(3, 5, 7),
FUN = extract_key_value_pairs,
info_tmp = info_tmp
) %>%
do.call(rbind, .)
Results
df
## name from to description cv_id
## 1 Kurt J. Lauber 03/20/2013 03/20/2013 Mr. Lauber ... 1
## 2 Noah J. Monsen 02/28/2018 02/28/2018 Mr. Monsen ... 2
## 3 Lauri Brunner 09/30/2018 09/30/2018 Ms. Brunne ... 3
## 4 Darren M. Bagwell 02/29/2016 02/29/2016 Darren M. ... 4
## 5 David C. Francis 10/07/2011 10/07/2011 Francis is ... 5
## 6 Michael A. Binger 04/14/2010 04/14/2010 Binger has ... 6
## 7 David E. Heupel 04/14/2010 04/14/2010 Mr. Heupel ... 7
## 8 Matthew D. Finn 03/30/2007 03/30/2007 Mr. Finn h ... 8
## 9 Scott Vergin 03/30/2007 03/30/2007 Vergin has ... 9
## 10 Frederick L. Plautz 11/01/1995 11/01/1995 Plautz has ... 10
## 11 Clyde E. Bartter 01/01/1994 01/01/1994 Bartter is ... 11
## 12 Wayne C. Stevens 01/01/1994 01/01/1994 Stevens is ... 12
## 13 Julian C. Ball 07/16/1987 07/16/1987 Ball is a ... 13
df2
## cv_id key value
## 1 Other Assets Managed
## 2 Other Assets Managed
## 3 Other Assets Managed
## 4 Certification CFA
## 4 Other Assets Managed
## 5 Certification CFA
## 5 Education M.B.A. University of Pittsburgh, 1978
## 5 Education B.A. University of Pittsburgh, 1977
## 5 Other Assets Managed
## 6 Certification CFA
## 6 Education M.B.A. University of Minnesota, 1991
## 6 Education B.S. University of Minnesota, 1987
## 6 Other Assets Managed
## 7 Other Assets Managed
## 8 Certification CFA
## 8 Education B.A. University of Pennsylvania, 1984
## 8 Education M.B.A. University of Michigan, 1990
## 8 Other Assets Managed
## 9 Certification CFA
## 9 Education M.B.A. University of Minnesota, 1980
## 9 Education B.A. St. Olaf College, 1976
## 9 Other Assets Managed
## 10 Education M.S. University of Wisconsin, 1981
## 10 Education B.B.A. University of Wisconsin, 1979
## 10 Other Assets Managed
## 11 Certification CFA
## 11 Education M.B.A. Western Reserve University, 1964
## 11 Education B.A. Baldwin-Wallace College, 1953
## 11 Other Assets Managed
## 12 Certification CFA
## 12 Education M.B.A. University of Wisconsin,
## 12 Education B.B.A. University of Miami,
## 12 Other Assets Managed
## 13 Certification CFA
## 13 Education B.A. Kent State University, 1974
## 13 Education J.D. Cleveland State University, 1984
## 13 Other Assets Managed
I don't have a solution, as this is not an area I have worked with before. However, with brute force you can probably get the table, assuming you have a list of rules that can parse the text to a data frame.
Thought I'd share what I have though
# get the text
f <- xml_text(tab)
# split up, this bit is tricky..
split_f <- strsplit(f, split="\\\\t", perl=TRUE)[[1]]
split_f <- strsplit(split_f, split="\\\\n", perl=TRUE)
split_f <- unlist(split_f)
split_f <- trimws(split_f)
# find ones to remove
sort(table(split_f), decreasing = T)[1:5]
split_f <- split_f[split_f!="—"]
split_f <- split_f[split_f!=""]
# manually found where to split
keep <- split_f[2:108]
# text looks ok, but would need rules to extract the rows in to a data.frame
View(keep)

Extracting JSON-data from CSV file

I'm trying to extract a JSON data which is a column in a CSV file. So far I've come to the point where I've extracted the column in the right format, but the formatting is only correct when the variable type is factor. But I can't convert a factor to a json-file using the jsonlite package.
[1] {"id":509746197991998767,"visibility":{"percentage":100,"time":149797,"visible1":true,"visible2":false,"visible3":false,"activetab":true},"interaction":{"mouseovercount":1,"mouseovertime":1426,"videoplaytime":0,"engagementtime":0,"expandtime":0,"exposuretime":35192}}
Another approach is to use stringsAsFactors = F when importing, but I'm struggling in getting the formatting right, where each entry looks like this:
[1] "{\"id\":509746197991998767,\"visibility\":{\"percentage\":100,\"time\":149797,\"visible1\":true,\"visible2\":false,\"visible3\":false,\"activetab\":true},\"interaction\":{\"mouseovercount\":1,\"mouseovertime\":1426,\"videoplaytime\":0,\"engagementtime\":0,\"expandtime\":0,\"exposuretime\":35192}}"
Am I missing something obvious here? I simply just want to exract the JSON files that sits inside a CSV file.
Heres a small example of the CSV file:
"","CookieID","UnloadVars"
"1",-8857188784608690176,"{""id"":509746197991998767,""visibility"":{""percentage"":100,""time"":149797,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":1,""mouseovertime"":1426,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":35192}}"
"2",-1695626857458244096,"{""id"":2917654329769114342,""visibility"":{""percentage"":46,""time"":0,""visible1"":false,""visible2"":false,""visible3"":false,""activetab"":true}}"
"3",437299165071669184,"{""id"":2252707957388071809,""visibility"":{""percentage"":99,""time"":10168,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":0,""mouseovertime"":0,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":542},""clicks"":[{""x"":105,""y"":449}]}"
"4",292660729552227520,""
"5",7036383942916227072,"{""id"":2299674593327687292,""visibility"":{""percentage"":76,""time"":1145,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":0,""mouseovertime"":0,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":74},""clicks"":[{""x"":197,""y"":135},{""x"":197,""y"":135}]}"
Regards,
Frederik.
df <- readr::read_csv('"","CookieID","UnloadVars"
"1",-8857188784608690176,"{""id"":509746197991998767,""visibility"":{""percentage"":100,""time"":149797,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":1,""mouseovertime"":1426,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":35192}}"
"2",-1695626857458244096,"{""id"":2917654329769114342,""visibility"":{""percentage"":46,""time"":0,""visible1"":false,""visible2"":false,""visible3"":false,""activetab"":true}}"
"3",437299165071669184,"{""id"":2252707957388071809,""visibility"":{""percentage"":99,""time"":10168,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":0,""mouseovertime"":0,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":542},""clicks"":[{""x"":105,""y"":449}]}"
"4",292660729552227520,""
"5",7036383942916227072,"{""id"":2299674593327687292,""visibility"":{""percentage"":76,""time"":1145,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":0,""mouseovertime"":0,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":74},""clicks"":[{""x"":197,""y"":135},{""x"":197,""y"":135}]}"',
col_types = "-cc")
Using jsonlite::fromJSON on each separate value, then tidyr::unnest
library(dplyr)
f <- function(.x)
if (is.na(.x) || .x == "") data.frame()[1, ] else
as.data.frame(jsonlite::fromJSON(.x))
df %>%
tidyr::unnest(UnloadVars = lapply(UnloadVars, f)) %>%
mutate_at(vars(ends_with("id")), as.character)
# A tibble: 6 x 16
# CookieID id visibility.percentage visibility.time visibility.visible1 visibility.visible2 visibility.visible3 visibility.activetab interaction.mouseovercount interaction.mouseovertime interaction.videoplaytime interaction.engagementtime interaction.expandtime interaction.exposuretime clicks.x clicks.y
# <chr> <chr> <int> <int> <lgl> <lgl> <lgl> <lgl> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 -8857188784608690176 509746197991998784 100 149797 TRUE FALSE FALSE TRUE 1 1426 0 0 0 35192 NA NA
# 2 -1695626857458244096 2917654329769114112 46 0 FALSE FALSE FALSE TRUE NA NA NA NA NA NA NA NA
# 3 437299165071669184 2252707957388071936 99 10168 TRUE FALSE FALSE TRUE 0 0 0 0 0 542 105 449
# 4 292660729552227520 <NA> NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 5 7036383942916227072 2299674593327687168 76 1145 TRUE FALSE FALSE TRUE 0 0 0 0 0 74 197 135
# 6 7036383942916227072 2299674593327687168 76 1145 TRUE FALSE FALSE TRUE 0 0 0 0 0 74 197 135
I used readr::read_csv to read in your sample data set.
> df <- readr::read_csv('~/sample.csv')
Parsed with column specification:
cols(
CookieID = col_double(),
UnloadVars = col_character()
)
As you can see the UnloadVars are read in as characters and not factors. If I now examine the first value in the UnloadVars columns I see the following which matches what you get,
> df$UnloadVars[1]
[1] "{\"id\":509746197991998767,\"visibility\":{\"percentage\":100,\"time\":149797,\"visible1\":true,\"visible2\":false,\"visible3\":false,\"activetab\":true},\"interaction\":{\"mouseovercount\":1,\"mouseovertime\":1426,\"videoplaytime\":0,\"engagementtime\":0,\"expandtime\":0,\"exposuretime\":35192}}"
Now, I use jsonlite::fromJSON,
> j <- jsonlite::fromJSON(df$UnloadVars[1])
> j
$id
[1] 5.097462e+17
$visibility
$visibility$percentage
[1] 100
$visibility$time
[1] 149797
$visibility$visible1
[1] TRUE
$visibility$visible2
[1] FALSE
$visibility$visible3
[1] FALSE
$visibility$activetab
[1] TRUE
$interaction
$interaction$mouseovercount
[1] 1
$interaction$mouseovertime
[1] 1426
$interaction$videoplaytime
[1] 0
$interaction$engagementtime
[1] 0
$interaction$expandtime
[1] 0
$interaction$exposuretime
[1] 35192
Which I believe is what you need since JSONs are parsed as lists in R.
It can be very tricky to deal with JSON data. As a general guide line, you should always strive to have your data in a data frame. This, however, is not always possible. In the specific case, I don't see a way you can have both visibility and interaction values at once in a nicely formatted data frame.
What I will do next is to extract the information from interaction into a data frame.
Load required packages and read the data
library(purrr)
library(dplyr)
library(tidyr)
df <- read.csv("sample.csv", stringsAsFactors = FALSE)
Then remove unvalid JSON
# remove rows without JSON (in this case, the 4th row)
df <- df %>%
dplyr::filter(UnloadVars != "")
Transform each JSON into a list and put them into UnloadVars column. If you didn't know that, it is possible to have list column in a data frame. This can be very useful.
out <- data_frame(CookieID = numeric(),
UnloadVars = list())
for (row in 1:nrow(df)) {
new_row <- data_frame(CookieID = df[row, ]$CookieID,
UnloadVars = list(jsonlite::fromJSON(df[row, ]$UnloadVars)))
out <- bind_rows(out, new_row)
}
out
We can now extract the IDs from the lists in Unload Vars. This is straight forward because there is only one ID per list.
out <- out %>%
mutate(id = map_chr(UnloadVars, ~ .$id))
This final part can seem a bit intimidating. But what I am doing here is taking interaction part from UnloadVars column and putting it into a interaction column. I then transform each row from interaction, which is a list, into a data frame with two columns: key and value. key contains the name of the interaction metric and value its value. I finally unnest it, so we get rid of list columns and end up with a nicely formatted data frame.
unpack_list <- function(obj, key_name) {
as.data.frame(obj) %>%
gather(key) %>%
return()
}
df_interaction <- out %>%
mutate(interaction = map(UnloadVars, ~ .$interaction)) %>%
mutate(interaction = map(interaction, ~ unpack_list(.x, key))) %>%
unnest(interaction)
df_interaction
The solution is not very elegant, but gets the job done. You could apply the same logic to extract information from visibility.

Community detection with bipartite graph in igraph

I have bipartite list (posts, word categories) with 1000 vertecies and want to use the fast and greedy algorithm for community detection, but I am not sure if I have to run it on the bipartite graph or the bipartite projection.
My bipartite list looks like this:
post word
1 66 2
2 312 1
3 432 7
4 433 7
5 434 1
6 435 5
7 436 1
8 437 4
When I run it without a projection I have problems clustering in the second step:
### Load bipartie list and create graph ###
bipartite_list <- read.csv("bipartite_list_tnf.csv", header = TRUE, sep = ";")
bipartite_graph <- graph.incidence(bipartite_list)
g<-bipartite_graph
fc <- fastgreedy.community(g) ## communities / clusters
set.seed(123)
l <- layout.fruchterman.reingold(g, niter=1000, coolexp=0.5) ## layout
membership(fc)
# 2. checking who is in each cluster
cl <- data.frame(name = fc$post, cluster = fc$membership, stringsAsFactors=F)
cl <- cl[order(cl$cluster),]
cl[cl$cluster==1,]
# 3. preparing data for plot
d <- data.frame(l); names(d) <- c("x", "y")
d$cluster <- factor(fc$membership)
# 4. plot with only nodes, colored by cluster
p <- ggplot(d, aes(x=x, y=y, color=cluster))
pq <- p + geom_point()
pq
Maybe I have to run the communnity detection on a projection? But then I always get I failure because a projection is not a graph object:
bipartite_graph <- graph.incidence(bipartite_list)
#projection (both directions)
projection_word_post <- bipartite.projection(bipartite_graph)
fc <- fastgreedy.community(projection_word_post)
Fehler in fastgreedy.community(projection_word_post) : Not a graph object
I would be glad for help!
When you run without the projection the issue is at:
bipartite_graph <- graph.incidence(bipartite_list)
You need to reshape 'bipartite_list' before applying into graph.incidence() function. Use the below command
tab <- table(bipartite_list)
and rest of the steps are same
g <- graph.incidence(tab,mode=c("all"))
fc <- fastgreedy.community(g)