How to handle HTTP error 503 when making API calls to process JSON files in R with the jsonlite package? - json

I'm having problems using the JSONlite package in R to collect Dota2 match data using the Steam API. I am not an experienced developer and really appreciate any help. Thanks!
I have created a script in R. When I check the API call using a web browser it correctly returns the JSON contents, but when I execute the very same API call in R (either in a for loop or as a single call) using the fromJSON() function, I get the following errors:
Error in open.connection(con, "rb") : HTTP error 503.
In addition: Warning message:
closing unused connection 3 (https://api.steampowered.com/IDOTA2Match_570/GetMatchDetails/V001/?match_id=2170111273&key=XXXXXXXXXXPLACEHOLDERXXXXXXXXXXX)
This is the R script I have created to collect multiple JSON responses using the fromJSON command and jsonlite:
# Load required libraries
library(rvest)
library(stringr)
library(magrittr)
library(plyr)
library(dplyr)
library(tidyr)
library(knitr)
library(XML)
library(data.table)
library(foreign)
library(pbapply)
library(jsonlite)
## Set base url components
base.url_0 = "https://api.steampowered.com/IDOTA2Match_570/GetMatchDetails/V001/?match_id="
base.url_0.1 = "&key="
steamAPIkey = "XXXXXXXXXXPLACEHOLDERXXXXXXXXXXX" # Steam API Key
### Create for loop where each "i" is a DOTA2 match ID
for(i in seq(1:length(targets$match_id))) {
base.url = paste0(
base.url_0,
targets$match_id[i],
base.url_0.1,
steamAPIkey)
message("Retrieving page ", targets$match_id[i])
## Get JSON response and store into data.frame
ifelse(
tmp_json <- fromJSON(
txt = base.url,flatten = T), # if the json file exists
as.data.frame(tmp_errors_1$matches) <- base.url # if the json file does not exists
) # close ifelse statement
tmp_json <- try_default(
expr =
as.data.frame(tmp_json), # convert json file into a data frame
default =
as.data.frame(tmp_errors_2$matches) <- base.url, quiet = T) # if error, add match id to a dataframe
## Rbindlist
l = list(results, tmp_json)
results <- rbindlist(l,fill = T)
## Sleep for x seconds
Sys.sleep(runif(1, 2, 3))
## End of loop
}

Related

Error: 'attrition' is not an exported object from 'namespace:rsample'

I am using a book Hands on Machine Learning By Bradley Boehmke. i wrote following line to access the data but i gave this error.
attrition <- rsample::attrition
Error: 'attrition' is not an exported object from 'namespace:rsample'
I used to have the same question when I went across it, but then I found out that now the attrition DataFrame belongs to modeldata package. simply use the following code:
# Load the required libraries first
my_libraries <- c("rsample","modeldata","caret", "h2o", "dplyr",
"ggplot2")
lapply(my_libraries, require, character.only = T)
# update 24.12.2021: it is best to call the dataset like:
data(attrition)
# **PROBLEM STATEMENT** here: there is a list of ordered factor variables,
# which is not_possible for h2o to deal with.
# for Job attrition data
churn <- attrition %>% mutate_if(is.ordered, .funs = factor,ordered = F)
# NOTE: funs() can create a list of function calls.
churn.h2o <- as.h2o(churn)
churn.h2o %>% View()
library(modeldata)
data("attrition", package = "modeldata")
churn <- attrition %>% mutate_if(is.ordered, .funs = factor,ordered = F)

is it possible to process file reading and parsing in R

There are bunch of files in a directory that has json formatted entries in each line. The size of the files varies from 5k to 200MB. I have this code to go though each file, parse the data I am looking for in the json and finally form a data frame. This script is taking a very long time to finish, in fact it never finishes.
Is there any way to speed it up so that I can read the files faster?
Code:
library(jsonlite)
library(data.table)
setwd("C:/Files/")
#data <- lapply(readLines("test.txt"), fromJSON)
df<-data.frame(Timestamp=factor(),Source=factor(),Host=factor(),Status=factor())
filenames <- list.files("Json_files", pattern="*.txt", full.names=TRUE)
for(i in filenames){
print(i)
data <- lapply(readLines(i), fromJSON)
myDf <- do.call("rbind", lapply(data, function(d) {
data.frame(TimeStamp = d$payloadData$timestamp,
Source = d$payloadData$source,
Host = d$payloadData$host,
Status = d$payloadData$status)}))
df<-rbind(df,myDf)
}
This is a sample entry but there are thousands of entries like this in the file:
{"senderDateTimeStamp":"2016/04/08 10:53:18","senderHost":null,"senderAppcode":"app","senderUsecase":"appinternalstats_prod","destinationTopic":"app_appinternalstats_realtimedata_topic","correlatedRecord":false,"needCorrelationCacheCleanup":false,"needCorrelation":false,"correlationAttributes":null,"correlationRecordCount":0,"correlateTimeWindowInMills":0,"lastCorrelationRecord":false,"realtimeESStorage":true,"receiverDateTimeStamp":1460127623591,"payloadData":{"timestamp":"2016-04-08T10:53:18.169","status":"get","source":"STREAM","fund":"JVV","client":"","region":"","evetid":"","osareqid":"","basis":"","pricingdate":"","content":"","msgname":"","recipient":"","objid":"","idlreqno":"","host":"WEB01","servermember":"test"},"payloadDataText":"","key":"app:appinternalstats_prod","destinationTopicName":"app_appinternalstats_realtimedata_topic","hdfsPath":"app/appinternalstats_prod","esindex":"app","estype":"appinternalstats_prod","useCase":"appinternalstats_prod","appCode":"app"}
{"senderDateTimeStamp":"2016/04/08 10:54:18","senderHost":null,"senderAppcode":"app","senderUsecase":"appinternalstats_prod","destinationTopic":"app_appinternalstats_realtimedata_topic","correlatedRecord":false,"needCorrelationCacheCleanup":false,"needCorrelation":false,"correlationAttributes":null,"correlationRecordCount":0,"correlateTimeWindowInMills":0,"lastCorrelationRecord":false,"realtimeESStorage":true,"receiverDateTimeStamp":1460127623591,"payloadData":{"timestamp":"2016-04-08T10:53:18.169","status":"get","source":"STREAM","fund":"JVV","client":"","region":"","evetid":"","osareqid":"","basis":"","pricingdate":"","content":"","msgname":"","recipient":"","objid":"","idlreqno":"","host":"WEB02","servermember":""},"payloadDataText":"","key":"app:appinternalstats_prod","destinationTopicName":"app_appinternalstats_realtimedata_topic","hdfsPath":"app/appinternalstats_prod","esindex":"app","estype":"appinternalstats_prod","useCase":"appinternalstats_prod","appCode":"app"}
{"senderDateTimeStamp":"2016/04/08 10:55:18","senderHost":null,"senderAppcode":"app","senderUsecase":"appinternalstats_prod","destinationTopic":"app_appinternalstats_realtimedata_topic","correlatedRecord":false,"needCorrelationCacheCleanup":false,"needCorrelation":false,"correlationAttributes":null,"correlationRecordCount":0,"correlateTimeWindowInMills":0,"lastCorrelationRecord":false,"realtimeESStorage":true,"receiverDateTimeStamp":1460127623591,"payloadData":{"timestamp":"2016-04-08T10:53:18.169","status":"get","source":"STREAM","fund":"JVV","client":"","region":"","evetid":"","osareqid":"","basis":"","pricingdate":"","content":"","msgname":"","recipient":"","objid":"","idlreqno":"","host":"WEB02","servermember":""},"payloadDataText":"","key":"app:appinternalstats_prod","destinationTopicName":"app_appinternalstats_realtimedata_topic","hdfsPath":"app/appinternalstats_prod","esindex":"app","estype":"appinternalstats_prod","useCase":"appinternalstats_prod","appCode":"app"}
With your example data in "c:/tmp.txt":
> df <- jsonlite::fromJSON(paste0("[",paste0(readLines("c:/tmp.txt"),collapse=","),"]"))$payloadData[c("timestamp","source","host","status")]
> df
timestamp source host status
1 2016-04-08T10:53:18.169 STREAM WEB01 get
2 2016-04-08T10:53:18.169 STREAM WEB02 get
3 2016-04-08T10:53:18.169 STREAM WEB02 get
So to adapt your code to get a list of dataframes:
dflist <- lapply(filenames, function(i) {
jsonlite::fromJSON(
paste0("[",
paste0(readLines(i),collapse=","),
"]")
)$payloadData[c("timestamp","source","host","status")]
})
The idea is to transform your lines (from readLines) into a big json array and then create the dataframe by parsing it as json.
As lmo already showcased, using lapply on your filenmaes list procide you with a list of dataframes, if you really want only one dataframe at end you can load the data.table packages and then use rbindlist on dflist to get only one dataframe.
Or if you're short in memory this thread may help you.
One speed up is to replace your for loop with lapply Then drop the final rbind. the speed up here would be that R would not have to repeatedly copy an increasingly large file, df over your "bunch" of files. The result would be stored in a convenient list that you could either use as is or convert to a data.frame in one go:
# create processing function
getData <- function(i) {
print(i)
data <- lapply(readLines(i), fromJSON)
myDf <- do.call("rbind", lapply(data, function(d) {
data.frame(TimeStamp = d$payloadData$timestamp,
Source = d$payloadData$source,
Host = d$payloadData$host,
Status = d$payloadData$status)}))
}
# lapply over files
myDataList <- lapply(filenames, getData)

R: How to cache scraped websites (XML package) for later processing

I have the following function to webscrape websites:
library(XML)
dl_url <- function(link_url) {
con <- url(link_url)
raw_data <- readLines(con)
close(con)
parsed_data <- htmlTreeParse(raw_data, useInternalNodes = TRUE)
parsed_data
}
When I use:
URLs <- lapply(list_urls, dl_url)
I get the expected list of parsed websites,
str(URLs):
List of x
$ :Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>
$ :Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>
....
However, I am unable to store the data. dput(URLs) only yields a 1 kb file with text in it.
What is the best way to locally cache (parsed) html websites in R?
Thank you very much!

R: Extract JSON Variable Info

I'm trying to download NBA player information from Numberfire and then put that information into a data frame. However I seem to be running into a few issues
The following snippet downloads the information just fine
require(RCurl)
require(stringr)
require(rjson)
#download data from numberfire
nf <- "https://www.numberfire.com/nba/fantasy/fantasy-basketball-projections"
html <- getURL(nf)
Then there is what I assume to be a JSON data structure
#extract json variable (?)
pat <- "NF_DATA.*}}}"
jsn <- str_extract(html, pat)
jsn <- str_split(jsn, "NF_DATA = ")
parse <- newJSONParser()
parse$addData(jsn)
It seems to add data OK as it doesn't throw any errors, but if there is data in that object I can't tell or seem to get it out!
I'd paste in the jsn variable but it's way over the character limit. Any hints as to where I'm going wrong would be much appreciated
Adding the final line gets a nice list format that you can transform to a data.frame
require(RCurl); require(stringr); require(rjson)
#download data from numberfire
nf <- "https://www.numberfire.com/nba/fantasy/fantasy-basketball-projections"
html <- getURL(nf)
#extract json variable (?)
pat <- "NF_DATA.*}}}"
jsn <- str_extract(html, pat)
jsn <- str_split(jsn, "NF_DATA = ")
fromJSON(jsn[[1]][[2]])

Importing data from a JSON file into R [duplicate]

This question already has answers here:
Parse JSON with R
(6 answers)
Closed 2 years ago.
Is there a way to import data from a JSON file into R? More specifically, the file is an array of JSON objects with string fields, objects, and arrays. The RJSON Package isn't very clear on how to deal with this http://cran.r-project.org/web/packages/rjson/rjson.pdf.
First install the rjson package:
install.packages("rjson")
Then:
library("rjson")
json_file <- "http://api.worldbank.org/country?per_page=10&region=OED&lendingtype=LNX&format=json"
json_data <- fromJSON(paste(readLines(json_file), collapse=""))
Update: since version 0.2.1
json_data <- fromJSON(file=json_file)
jsonlite will import the JSON into a data frame. It can optionally flatten nested objects. Nested arrays will be data frames.
> library(jsonlite)
> winners <- fromJSON("winners.json", flatten=TRUE)
> colnames(winners)
[1] "winner" "votes" "startPrice" "lastVote.timestamp" "lastVote.user.name" "lastVote.user.user_id"
> winners[,c("winner","startPrice","lastVote.user.name")]
winner startPrice lastVote.user.name
1 68694999 0 Lamur
> winners[,c("votes")]
[[1]]
ts user.name user.user_id
1 Thu Mar 25 03:13:01 UTC 2010 Lamur 68694999
2 Thu Mar 25 03:13:08 UTC 2010 Lamur 68694999
An alternative package is RJSONIO. To convert a nested list, lapply can help:
l <- fromJSON('[{"winner":"68694999", "votes":[
{"ts":"Thu Mar 25 03:13:01 UTC 2010", "user":{"name":"Lamur","user_id":"68694999"}},
{"ts":"Thu Mar 25 03:13:08 UTC 2010", "user":{"name":"Lamur","user_id":"68694999"}}],
"lastVote":{"timestamp":1269486788526,"user":
{"name":"Lamur","user_id":"68694999"}},"startPrice":0}]'
)
m <- lapply(
l[[1]]$votes,
function(x) c(x$user['name'], x$user['user_id'], x['ts'])
)
m <- do.call(rbind, m)
gives information on the votes in your example.
If the URL is https, like used for Amazon S3, then use getURL
json <- fromJSON(getURL('https://s3.amazonaws.com/bucket/my.json'))
First install the RJSONIO and RCurl package:
install.packages("RJSONIO")
install.packages("(RCurl")
Try below code using RJSONIO in console
library(RJSONIO)
library(RCurl)
json_file = getURL("https://raw.githubusercontent.com/isrini/SI_IS607/master/books.json")
json_file2 = RJSONIO::fromJSON(json_file)
head(json_file2)
load the packages:
library(httr)
library(jsonlite)
I have had issues converting json to dataframe/csv. For my case I did:
Token <- "245432532532"
source <- "http://......."
header_type <- "applcation/json"
full_token <- paste0("Bearer ", Token)
response <- GET(n_source, add_headers(Authorization = full_token, Accept = h_type), timeout(120), verbose())
text_json <- content(response, type = 'text', encoding = "UTF-8")
jfile <- fromJSON(text_json)
df <- as.data.frame(jfile)
then from df to csv.
In this format it should be easy to convert it to multiple .csvs if needed.
The important part is content function should have type = 'text'.
import httr package
library(httr)
Get the url
url <- "http://www.omdbapi.com/?apikey=72bc447a&t=Annie+Hall&y=&plot=short&r=json"
resp <- GET(url)
Print content of resp as text
content(resp, as = "text")
Print content of resp
content(resp)
Use content() to get the content of resp, but this time do not specify
a second argument. R figures out automatically that you're dealing
with a JSON, and converts the JSON to a named R list.