I was trying to rbind some json data scraped from api
library(jsonlite)
pop_dat <- data.frame()
for (i in 1:3) {
# Generate url for each page
url <- paste0('http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=',i)
# Get json data from each page and transform it into dataframe
dat <- as.data.frame(fromJSON(url)[2],flatten = TRUE, row.names = NULL)
pop_dat <- rbind(pop_dat, dat)
}
However, it returns the following error:
Error in row.names<-.data.frame(*tmp*, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘1’, ‘10’, ‘11’, ‘12’, ‘13’, ‘14’, ‘15’, ‘16’, ‘17’, ‘18’, ‘19’, ‘2’, ‘20’, ‘21’, ‘22’, ‘23’, ‘24’, ‘25’, ‘26’, ‘27’, ‘28’, ‘29’, ‘3’, ‘30’, ‘31’, ‘32’, ‘33’, ‘34’, ‘35’, ‘36’, ‘37’, ‘38’, ‘39’, ‘4’, ‘40’, ‘41’, ‘42’, ‘43’, ‘44’, ‘45’, ‘46’, ‘47’, ‘48’, ‘49’, ‘5’, ‘50’, ‘6’, ‘7’, ‘8’, ‘9’
Changing the row.names to null doesn't work. I heard from someone it is due to the fact that some data are stored as lists here, which I don't quite understand.
I understand that there is an alternative package WDI to access this data and it works well, but I want to know how to resolve the duplicates row name problem here in general so that I can deal with similar situation where no alternative package is available.
I heard from someone it is due to the fact that some data are stored as lists...
This is correct. The solution is fairly simple, but I find it really easy to get tripped up by this. Right now you're using:
dat <- as.data.frame(fromJSON(url)[2],flatten = TRUE, row.names = NULL)
The problem comes from fromJSON(url)[2]. This should be fromJSON(url)[[2]] instead. According to the documentation, the key difference between [ and [[ is a single bracket can select multiple elements whereas [[ selects only one.
You can see how this works with some fake data.
foo <- list(
a = rnorm(100),
b = rnorm(100),
c = rnorm(100)
)
With [, you can select multiple values inside this list.
foo[c("a", "b")]
length(foo["a"]) # Result is 1 not 100 like you might expect.
With [[ the results are different.
foo[[c("a", "b")]] # Raises a subscript error.
foo[["a"]] #This works.
length(foo[["a"]]) # Result is 100.
So, your answer will depend on which subset operator you're using. For your problem, you'll want to use [[ to select a single data.frame inside of the list. Then, you should be able to use rbind correctly.
final <- data.frame()
for (i in 1:10) {
url <- paste0(
'http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=',
i
)
res <- jsonlite::fromJSON(url, flatten = TRUE)[[2]]
final <- rbind(final, res)
}
Alternative solution with lapply:
urls <- sprintf(
'http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=%s',
1:10
)
resl <- lapply(urls, jsonlite::fromJSON, flatten = TRUE)
resl <- lapply(resl, "[[", 2) # Use lapply to select the 2 element from each list element.
resl <- do.call(rbind, resl) # This takes all the elements of the list and uses those elements as the arguments for rbind.
I'm having problems using the JSONlite package in R to collect Dota2 match data using the Steam API. I am not an experienced developer and really appreciate any help. Thanks!
I have created a script in R. When I check the API call using a web browser it correctly returns the JSON contents, but when I execute the very same API call in R (either in a for loop or as a single call) using the fromJSON() function, I get the following errors:
Error in open.connection(con, "rb") : HTTP error 503.
In addition: Warning message:
closing unused connection 3 (https://api.steampowered.com/IDOTA2Match_570/GetMatchDetails/V001/?match_id=2170111273&key=XXXXXXXXXXPLACEHOLDERXXXXXXXXXXX)
This is the R script I have created to collect multiple JSON responses using the fromJSON command and jsonlite:
# Load required libraries
library(rvest)
library(stringr)
library(magrittr)
library(plyr)
library(dplyr)
library(tidyr)
library(knitr)
library(XML)
library(data.table)
library(foreign)
library(pbapply)
library(jsonlite)
## Set base url components
base.url_0 = "https://api.steampowered.com/IDOTA2Match_570/GetMatchDetails/V001/?match_id="
base.url_0.1 = "&key="
steamAPIkey = "XXXXXXXXXXPLACEHOLDERXXXXXXXXXXX" # Steam API Key
### Create for loop where each "i" is a DOTA2 match ID
for(i in seq(1:length(targets$match_id))) {
base.url = paste0(
base.url_0,
targets$match_id[i],
base.url_0.1,
steamAPIkey)
message("Retrieving page ", targets$match_id[i])
## Get JSON response and store into data.frame
ifelse(
tmp_json <- fromJSON(
txt = base.url,flatten = T), # if the json file exists
as.data.frame(tmp_errors_1$matches) <- base.url # if the json file does not exists
) # close ifelse statement
tmp_json <- try_default(
expr =
as.data.frame(tmp_json), # convert json file into a data frame
default =
as.data.frame(tmp_errors_2$matches) <- base.url, quiet = T) # if error, add match id to a dataframe
## Rbindlist
l = list(results, tmp_json)
results <- rbindlist(l,fill = T)
## Sleep for x seconds
Sys.sleep(runif(1, 2, 3))
## End of loop
}
There are bunch of files in a directory that has json formatted entries in each line. The size of the files varies from 5k to 200MB. I have this code to go though each file, parse the data I am looking for in the json and finally form a data frame. This script is taking a very long time to finish, in fact it never finishes.
Is there any way to speed it up so that I can read the files faster?
Code:
library(jsonlite)
library(data.table)
setwd("C:/Files/")
#data <- lapply(readLines("test.txt"), fromJSON)
df<-data.frame(Timestamp=factor(),Source=factor(),Host=factor(),Status=factor())
filenames <- list.files("Json_files", pattern="*.txt", full.names=TRUE)
for(i in filenames){
print(i)
data <- lapply(readLines(i), fromJSON)
myDf <- do.call("rbind", lapply(data, function(d) {
data.frame(TimeStamp = d$payloadData$timestamp,
Source = d$payloadData$source,
Host = d$payloadData$host,
Status = d$payloadData$status)}))
df<-rbind(df,myDf)
}
This is a sample entry but there are thousands of entries like this in the file:
{"senderDateTimeStamp":"2016/04/08 10:53:18","senderHost":null,"senderAppcode":"app","senderUsecase":"appinternalstats_prod","destinationTopic":"app_appinternalstats_realtimedata_topic","correlatedRecord":false,"needCorrelationCacheCleanup":false,"needCorrelation":false,"correlationAttributes":null,"correlationRecordCount":0,"correlateTimeWindowInMills":0,"lastCorrelationRecord":false,"realtimeESStorage":true,"receiverDateTimeStamp":1460127623591,"payloadData":{"timestamp":"2016-04-08T10:53:18.169","status":"get","source":"STREAM","fund":"JVV","client":"","region":"","evetid":"","osareqid":"","basis":"","pricingdate":"","content":"","msgname":"","recipient":"","objid":"","idlreqno":"","host":"WEB01","servermember":"test"},"payloadDataText":"","key":"app:appinternalstats_prod","destinationTopicName":"app_appinternalstats_realtimedata_topic","hdfsPath":"app/appinternalstats_prod","esindex":"app","estype":"appinternalstats_prod","useCase":"appinternalstats_prod","appCode":"app"}
{"senderDateTimeStamp":"2016/04/08 10:54:18","senderHost":null,"senderAppcode":"app","senderUsecase":"appinternalstats_prod","destinationTopic":"app_appinternalstats_realtimedata_topic","correlatedRecord":false,"needCorrelationCacheCleanup":false,"needCorrelation":false,"correlationAttributes":null,"correlationRecordCount":0,"correlateTimeWindowInMills":0,"lastCorrelationRecord":false,"realtimeESStorage":true,"receiverDateTimeStamp":1460127623591,"payloadData":{"timestamp":"2016-04-08T10:53:18.169","status":"get","source":"STREAM","fund":"JVV","client":"","region":"","evetid":"","osareqid":"","basis":"","pricingdate":"","content":"","msgname":"","recipient":"","objid":"","idlreqno":"","host":"WEB02","servermember":""},"payloadDataText":"","key":"app:appinternalstats_prod","destinationTopicName":"app_appinternalstats_realtimedata_topic","hdfsPath":"app/appinternalstats_prod","esindex":"app","estype":"appinternalstats_prod","useCase":"appinternalstats_prod","appCode":"app"}
{"senderDateTimeStamp":"2016/04/08 10:55:18","senderHost":null,"senderAppcode":"app","senderUsecase":"appinternalstats_prod","destinationTopic":"app_appinternalstats_realtimedata_topic","correlatedRecord":false,"needCorrelationCacheCleanup":false,"needCorrelation":false,"correlationAttributes":null,"correlationRecordCount":0,"correlateTimeWindowInMills":0,"lastCorrelationRecord":false,"realtimeESStorage":true,"receiverDateTimeStamp":1460127623591,"payloadData":{"timestamp":"2016-04-08T10:53:18.169","status":"get","source":"STREAM","fund":"JVV","client":"","region":"","evetid":"","osareqid":"","basis":"","pricingdate":"","content":"","msgname":"","recipient":"","objid":"","idlreqno":"","host":"WEB02","servermember":""},"payloadDataText":"","key":"app:appinternalstats_prod","destinationTopicName":"app_appinternalstats_realtimedata_topic","hdfsPath":"app/appinternalstats_prod","esindex":"app","estype":"appinternalstats_prod","useCase":"appinternalstats_prod","appCode":"app"}
With your example data in "c:/tmp.txt":
> df <- jsonlite::fromJSON(paste0("[",paste0(readLines("c:/tmp.txt"),collapse=","),"]"))$payloadData[c("timestamp","source","host","status")]
> df
timestamp source host status
1 2016-04-08T10:53:18.169 STREAM WEB01 get
2 2016-04-08T10:53:18.169 STREAM WEB02 get
3 2016-04-08T10:53:18.169 STREAM WEB02 get
So to adapt your code to get a list of dataframes:
dflist <- lapply(filenames, function(i) {
jsonlite::fromJSON(
paste0("[",
paste0(readLines(i),collapse=","),
"]")
)$payloadData[c("timestamp","source","host","status")]
})
The idea is to transform your lines (from readLines) into a big json array and then create the dataframe by parsing it as json.
As lmo already showcased, using lapply on your filenmaes list procide you with a list of dataframes, if you really want only one dataframe at end you can load the data.table packages and then use rbindlist on dflist to get only one dataframe.
Or if you're short in memory this thread may help you.
One speed up is to replace your for loop with lapply Then drop the final rbind. the speed up here would be that R would not have to repeatedly copy an increasingly large file, df over your "bunch" of files. The result would be stored in a convenient list that you could either use as is or convert to a data.frame in one go:
# create processing function
getData <- function(i) {
print(i)
data <- lapply(readLines(i), fromJSON)
myDf <- do.call("rbind", lapply(data, function(d) {
data.frame(TimeStamp = d$payloadData$timestamp,
Source = d$payloadData$source,
Host = d$payloadData$host,
Status = d$payloadData$status)}))
}
# lapply over files
myDataList <- lapply(filenames, getData)
I am trying to save about 300 HTML objects to disk using R.
str_url <- "https://www.holidayhouses.co.nz/Browse/List.aspx?page=1"
read_html_test1 <- xml2::read_html(str_url)
xml2::write_xml(read_html_test1, "testwrite.html")
read_html <- xml2::read_html("testwrite.html")
But this will eventually save about 300 separate files to disk. Ideally, what I would like is to save a single R object to disk that contains these 300 documents.
Converting each document to text before saving for some reason does not work. For example the following will product some weird (unhelpful) error:
str_html <- as.character(read_html_test1)
xml2::read_html(str_html)
If I try to use the output of xml2::read_html() it is a a pointer to a C structure and therefore this will not persist to disk.
Any suggestions for a hack to make this work...?
I managed it with the httr package, whose content function can take an as = "text" argument, which stops it from parsing the HTML.
library(xml2)
library(httr)
str_url <- "https://www.holidayhouses.co.nz/Browse/List.aspx?page=1"
# use `GET` to make the request, and pull out the html with `content`; returns text string
x <- content(GET(str_url), as = 'text')
# make a list of html documents to save
list_xs <- list(x, x)
# save list with `saveRDS`
saveRDS(list_xs, 'test.rds')
Now to see if it works:
# read in rds file we saved
saved_html <- readRDS('test.rds')
# parse the second element in it with `xml2::read_html`
saved_x_parsed <- read_html(saved_html[[2]])
# and let's see...
saved_x_parsed
# {xml_document}
# <html>
# [1] <head><title>
\n\tNew Zealand holiday homes, baches and vacation homes for rent.
\ ...
# [2] <body id="ctl00_Body" class="Page-List">
\n <div class="SatNavBarPlaceholder"/>
 ...
How to save R objects to disk:
Save R Objects
I took your example code and produced working, human readable, R-loadable output as follows:
str_url <- "https://www.holidayhouses.co.nz/Browse/List.aspx?page=1"
read_html_test1 <- xml2::read_html(str_url)
str_html <- as.character(read_html_test1)
x <- xml2::read_html(str_html)
save(x, file="c:\\temp\\text.txt",compress=FALSE,ascii=TRUE)
I was trying to use readHTMLTable to store some data in a dataframe in R Studio, but it just keeps telling me could not find function "ReadHTMLTable". I don't understand where I did wrong. Can someone take a lot at this and tell me how I can fix this? or if it works in your R studio.
url <- 'http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/case-counts.html'
ebola <- getURL(url)
ebola <- readHTMLTable(ebola, stringAsFactors = F)
Error: could not find function "readHTMLTable"
You are reading the table in with R default which converts characters to factors. You can use stringsAsFactors = FALSE in readHTMLTable and this will be passed to data.frame. Also the table uses commas for thousand seperators which you will need to remove :
library(XML)
url1 <-'http://en.wikipedia.org/wiki/List_of_Ebola_outbreaks'
df1<- readHTMLTable(url1, which = 2, stringsAsFactors = FALSE)
df1$"Human death"
mySum <- sum(as.integer(gsub(",", "", df1$"Human death")))
> mySum
[1] 6910
The problem is that you dont initialize de XML library
library(XML)