Converting JSON file to data.frame - json

I'm having a heck of a time trying to convert a JSON file to a data frame. I have searched and tried to use others' code to my example but none seem to fit. The output is always still a list instead of a data frame.
library(jsonlite)
URL <- getURL("http://scores.nbcsports.msnbc.com/ticker/data/gamesMSNBC.js.asp?xml=true&sport=NBA&period=20160104")
URLP <- fromJSON(URL, simplifyDataFrame = TRUE, flatten = FALSE)
URLP
Here is what format the answer always ends up in.
$games
[1] "<ticker-entry gamecode=\"2016010405\" gametype=\"Regular Season\"><visiting-team display_name=\"Toronto\" alias=\"Tor\" nickname=\"Raptors\" id=\"28\" division=\"ECA\" conference=\"EC\" score=\"\"><score heading=\"\" value=\"0\" team-fouls=\"0\"></score><team-record wins=\"21\" losses=\"14\"></team-record><team-logo link=\"http://hosted.stats.com/nba/logos/nba_50x33/Toronto_Raptors.png\" gz-image=\"http://hosted.stats.com/GZ/images/NBAlogos/TorontoRaptors.png\"></team-logo></visiting-team><home-team display_name=\"Cleveland\" alias=\"Cle\" nickname=\"Cavaliers\" id=\"5\" division=\"ECC\" conference=\"EC\" score=\"\"><score heading=\"\" value=\"0\" team-fouls=\"0\"></score><team-record wins=\"22\" losses=\"9\" ties=\"\"></team-record><team-logo link=\"http://hosted.stats.com/nba/logos/nba_50x33/Cleveland_Cavaliers.png\" gz-image=\"http://hosted.stats.com/GZ/images/NBAlogos/ClevelandCavaliers.png\"></team-logo></home-team><gamestate status=\"Pre-Game\" display_status1=\"7:00 PM\" display_status2=\"\" href=\"http://scores.nbcsports.msnbc.com/nba/preview.asp?g=2016010405\" tv=\"FSOH/SNT\" gametime=\"7:00 PM\" gamedate=\"1/4\" is-dst=\"0\" is-world-dst=\"0\"></gamestate></ticker-entry>"

With regards to #jbaums comment, you could try
library(jsonlite)
library(RCurl)
library(dplyr)
library(XML)
URL <- getURL("http://scores.nbcsports.msnbc.com/ticker/data/gamesMSNBC.js.asp?xml=true&sport=NBA&period=20160104")
lst <- lapply(fromJSON(URL)$games, function(x) as.data.frame(t(unlist(xmlToList(xmlParse(x)))), stringsAsFactors=FALSE))
df <- bind_rows(lst)
View(df)
... in theory. However, as #hrbrmstr pointed out: practically, this would violate the website owner's terms of service.

Related

Web scraping with R, solution with Jsonlite seems flaky

I maintain small scrips to extract financial data from websites. One of them retrieves the dutch natural gas grid balance. However, I keep getting problems with it as it works for a while, then get an error message and finally find a work around. Anyway, it seems that I am using a rather flaky method to do it. Could anyone guide me to a better direction (package) of getting this done?
Below I add the code (which again stopped working)
library(curl)
library(bitops)
url <- "https://www.gasunietransportservices.nl/en/shippers/balancing-regime/sbs-and-pos/graphactualjson/MWh"
h <- new_handle(copypostfields ="moo=moomooo")
handle_setheaders(h, "Content-Type" = "text/moo", "Cache-Control" = "no-cache", "User-Agent" = "A cow")
req <- curl_fetch_memory(url, handle=h)
x <- rawToChar(req$content)
library(jsonlite)
json_data <- fromJSON(x)
data <- json_data[,c(1,4)]
n=tail(data,1)
Many thanks
You can use rvest for this (but there could be better approaches too)
library(rvest)
json_data <- read_html('https://www.gasunietransportservices.nl/en/shippers/balancing-regime/sbs-and-pos/graphactualjson/MWh') %>%
html_text() %>%
jsonlite::fromJSON(.)
data <- json_data[,c(1,4)]
n=tail(data,1)
n
Output:
> n
sbsdatetime position
37 2017-11-16 12:00:00 -9
Slightly elegant solution if the dataframe isn't required:
library(rvest)
library(dplyr)
read_html('https://www.gasunietransportservices.nl/en/shippers/balancing-regime/sbs-and-pos/graphactualjson/MWh') %>%
html_text() %>%
jsonlite::fromJSON(.) %>%
select(1:4) %>%
tail(n=1)

How to read a <li> table in a webpage

I debug the program many times to get the result as follows:
url 研究所知识库列表
/handle/1471x/1 力学研究所
/handle/1471x/8865 半导体研究所
However, no metter what parameters I use, the result is not correct. The content in this table is one part of the basis of my further analysis, and I am very trembled for it. I'm looking forward to your help with great sincerity.
## download community-list ---the 1st level of IR Grid
#loading webpage and analyzing
community_url<-"http://www.irgrid.ac.cn/community-list"
com_source <- readLines(community_url, encoding = "UTF-8")
com_parsed <- htmlTreeParse(com_source, encoding = "UTF-8", useInternalNodes = TRUE)
# get table specs
tableNodes <- getNodeSet(com_parsed, "//table")
com_tb<-readHTMLTable(tableNodes[[8]], header=TRUE)
# get External links
xpath <- "//a/#href"
getHTMLExternalFiles(tableNodes[[8]], xpQuery = xpath)
it is unclear exactly what you want your end result to look like but if you modify your xpath statements a bit to take advantage of the DOM structure you can get something like this:
library(XML)
community_url<-"http://www.irgrid.ac.cn/community-list"
com_source <- readLines(community_url, encoding = "UTF-8")
com_parsed <- htmlTreeParse(com_source, encoding = "UTF-8", useInternalNodes = TRUE)
list_header <- xpathSApply(com_parsed, '//table[.//li]//h1', xmlValue)
hrefs <- xpathSApply(com_parsed, '//li[#class="communityLink"]//#href', function(x) unname(x))
display_text <- xpathSApply(com_parsed, '//li[#class="communityLink"]//a', xmlValue)
table_data <- cbind(display_text, hrefs)
colnames(table_data) <- c(list_header, "url")
table_data
console output causes stackoverflow to think this answer is spam but here is a screen shot:

R: Extract JSON Variable Info

I'm trying to download NBA player information from Numberfire and then put that information into a data frame. However I seem to be running into a few issues
The following snippet downloads the information just fine
require(RCurl)
require(stringr)
require(rjson)
#download data from numberfire
nf <- "https://www.numberfire.com/nba/fantasy/fantasy-basketball-projections"
html <- getURL(nf)
Then there is what I assume to be a JSON data structure
#extract json variable (?)
pat <- "NF_DATA.*}}}"
jsn <- str_extract(html, pat)
jsn <- str_split(jsn, "NF_DATA = ")
parse <- newJSONParser()
parse$addData(jsn)
It seems to add data OK as it doesn't throw any errors, but if there is data in that object I can't tell or seem to get it out!
I'd paste in the jsn variable but it's way over the character limit. Any hints as to where I'm going wrong would be much appreciated
Adding the final line gets a nice list format that you can transform to a data.frame
require(RCurl); require(stringr); require(rjson)
#download data from numberfire
nf <- "https://www.numberfire.com/nba/fantasy/fantasy-basketball-projections"
html <- getURL(nf)
#extract json variable (?)
pat <- "NF_DATA.*}}}"
jsn <- str_extract(html, pat)
jsn <- str_split(jsn, "NF_DATA = ")
fromJSON(jsn[[1]][[2]])

JSON to R for Data Mining

I am trying to grab tweets using the Topsy Otter api, so I can perform some data mining on it for my dissertation.
So far, I have got:
library(RJSONIO)
library(RCurl)
tweet_data <- getURL("http://otter.topsy.com/search.json?q=PSN&mintime=1301634000&perpage=10&maxtime=1304226000&apikey=xxx")
fromJSON(tweet_data)
Which works fine. Now however, I want to return just a couple details from this file, 'content' and 'trackback_date'. I cannot seem to figure out how - I have tried cobbling a couple of examples together, but unable to extract what I want.
Here is what I've tried so far:
trackback_date <- lapply(tweet_data$result, function(x){x$trackback_date})
content <- lapply(tweet_data$result, function(x){x$content})
Any help would be greatly appreciated, thank you.
edit
I have also tried:
library("rjson")
# use rjson
tweet_data <- fromJSON(paste(readLines("http://otter.topsy.com/search.json?q=PSN&mintime=1301634000&perpage=10&maxtime=1304226000&apikey=xxx"), collapse=""))
# get a data from Topsy Otter API
# convert JSON data into R object using fromJSON()
trackback_date <- lapply(tweet_data$result, function(x){x$trackback_date})
content <- lapply(tweet_data$result, function(x){x$content})
Basic processing of Topsy Otter API response:
library(RJSONIO)
library(RCurl)
tweet_data <- getURL("http://otter.topsy.com/search.json?q=PSN&mintime=1301634000&perpage=10&maxtime=1304226000&apikey=xxx")
#
# Addition to your code
#
tweets <- fromJSON(tweet_data)$response$list
content <- sapply(tweets, function(x) x$content)
trackback_date <- sapply(tweets, function(x) x$trackback_date)
EDIT: Processing multiple pages
Function gets 100 items from specified page:
pagetweets <- function(page){
url <- paste("http://otter.topsy.com/search.json?q=PSN&mintime=1301634000&page=",page,
"&perpage=100&maxtime=1304226000&apikey=xxx",
collapse="", sep="")
tweet_data <- getURL(url)
fromJSON(tweet_data)$response$list
}
Now we can apply it to multiple pages:
tweets <- unlist(lapply(1:10, pagetweets), recursive=F)
And, voila, this code:
content <- sapply(tweets, function(x) x$content)
trackback_date <- sapply(tweets, function(x) x$trackback_date)
returns you 1000 records.

How can I read and parse the contents of a webpage in R

I'd like to read the contents of a URL (e.q., http://www.haaretz.com/) in R. I am wondering how I can do it
Not really sure how you want to process that page, because it's really messy. As we re-learned in this famous stackoverflow question, it's not a good idea to do regex on html, so you will definitely want to parse this with the XML package.
Here's an example to get you started:
require(RCurl)
require(XML)
webpage <- getURL("http://www.haaretz.com/")
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# parse the tree by tables
x <- xpathSApply(pagetree, "//*/table", xmlValue)
# do some clean up with regular expressions
x <- unlist(strsplit(x, "\n"))
x <- gsub("\t","",x)
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\\1", x, perl=TRUE)
x <- x[!(x %in% c("", "|"))]
This results in a character vector of mostly just webpage text (along with some javascript):
> head(x)
[1] "Subscribe to Print Edition" "Fri., December 04, 2009 Kislev 17, 5770" "Israel Time: 16:48 (EST+7)"
[4] "  Make Haaretz your homepage" "/*check the search form*/" "function chkSearch()"
Your best bet may be the XML package -- see for example this previous question.
I know you asked for R. But maybe python+beautifullsoup is the way forward here? Then do your analysis with R you have scraped the screen with beautifullsoup?