I want to use R to crawl the news from url(http://www.foxnews.com/search-results/search?q="AlphaGo"&ss=fn&start=0). Here is my code:
url <- "http://api.foxnews.com/v1/content/search?q=%22AlphaGo%22&fields=date,description,title,url,image,type,taxonomy§ion.path=fnc&start=0&callback=angular.callbacks._0&cb=2017719162"
html <- str_c(readLines(url,encoding = "UTF-8"),collapse = "")
content_fox <- RJSONIO:: fromJSON(html)
However, the json could not be understood as the error showed up :
Error in file(con, "r") : cannot open the connection
I notice that the json starts from angular.callbacks._0 , which I think might be the problem.
Any idea how to fix this?
According to the answer in Parse JSONP with R, I ajusted my code with two new ones and it worked:
url <- "http://api.foxnews.com/v1/content/search?q=%22AlphaGo%22&fields=date,description,title,url,image,type,taxonomy§ion.path=fnc&start=0&callback=angular.callbacks._0&cb=2017719162"
html <- str_c(readLines(url,encoding = "UTF-8"),collapse = "")
html <- sub('[^\\{]*', '', html) # remove function name and opening parenthesis
html <- sub('\\)$', '', html) # remove closing parenthesis
content_fox <- RJSONIO:: fromJSON(html)
Related
I am currently trying to scrape the data from the two graphs on following html page (information from two graphs listed there: Forsmark and Ringhals): https://group.vattenfall.com/se/var-verksamhet/vara-energislag/karnkraft/aktuell-karnkraftsproduktion
The data originate from script tags like this (fragment)
<script type="text/javascript">
/*<![CDATA[*/ productionData = JSON.parse("{\"timestamp\":1582642616000,\"powerPlant\":\"Ringhals\", // etc
</script>
I would like to get two dataframes that looks like these:
F1 F2 F3
number number number
and
R1 R2 R3
number number number
I tried to use XML and xpath to parse an html page but did not get anywhere with that.
Do you have any ideas?
Thanks!
Those charts are <iframe>s that load from
https://gvp.vattenfall.com/sweden/produced-power/iframe/forsmark and
https://gvp.vattenfall.com/sweden/produced-power/iframe/ringhals
so you should scrape those two pages directly.
This was an interesting challenge.
It becomes not too hard with rvest and jsonlite, which you will have to install if you don't already have. Both require rtools.
Try this:
library('rvest')
library('jsonlite')
# Load the URL (do the same for the other iframe)
url <- 'https://gvp.vattenfall.com/sweden/produced-power/iframe/forsmark'
# Parse it
webpage <- read_html(url)
# Extract the script element. That's a CSS selector for the specific one that holds the json data
# You can find it in your browser's DevTools by finding the script element
# and right-clicking, choosing Copy > CSS Path/Selector
script_element <- html_nodes(webpage, 'body > section:nth-child(2) > script:nth-child(2)')
# Extract its string content
json = html_text(script_element)
# Clean it up
json = gsub("\n /*<![CDATA[*/\n productionData = JSON.parse(", "", json, fixed=TRUE)
json = gsub(");\n /*]]>*/\n ", "", json, fixed=TRUE)
json = gsub("\"{", "{\"", json, fixed=TRUE)
json = gsub("}\"", "}", json, fixed=TRUE)
json = gsub("{\"\\\"", "{\\\"", json, fixed=TRUE)
# Extract data
data = jsonlite::fromJSON(gsub("\\\"", "\"", json, fixed=TRUE))
Caveat: I'm not really an R expert, there is likely a more elegant way of doing this (particularly the data cleaning portion). But it works.
For historical preservation, that takes this DOM node (the text content of the <script> tag):
"\n /*<![CDATA[*/\n productionData = JSON.parse(\"{\\\"timestamp\\\":1582643336000,\\\"powerPlant\\\":\\\"Forsmark\\\",\\\"blockProductionDataList\\\":[{\\\"name\\\":\\\"F1\\\",\\\"production\\\":998.86194,\\\"percent\\\":99.88619},{\\\"name\\\":\\\"F2\\\",\\\"production\\\":1120.434,\\\"percent\\\":97.8545},{\\\"name\\\":\\\"F3\\\",\\\"production\\\":1189.7126,\\\"percent\\\":99.55754}]}\");\n /*]]>*/\n "
and will result in data of this format
> data
$timestamp
[1] 1.582647e+12
$powerPlant
[1] "Forsmark"
$blockProductionDataList
name production percent
1 F1 997.7902 99.77902
2 F2 1131.6150 98.83100
3 F3 1190.0520 99.58594
I have multiple JSON files containing Tweets from Twitter. I want to import and edit them in R one by one.
For a single file my code looks like this:
data <- fromJSON("filename.json")
data <- data[c(1:3,13,14)]
data$lang <- ifelse(data$lang!="de",NA,data$lang)
data <- na.omit(data)
write_as_csv(data,"filename.csv")
Now I want to apply this code to multiple files. I found a "for" loop code here:
Loop in R to read many files
Applied to my problem it should look something like this:
setwd("~/Documents/Elections")
ldf <- list()
listjson <- dir(pattern = "*.json")
for (k in 1:length(listjson)){
data[k] <- fromJSON(listjson[k])
data[k] <- data[k][c(1:3,13,14)]
data[k]$lang <- ifelse(data[k]$lang!="de",NA,data[k]$lang)
data[k] <- na.omit(data[k])
filename <- paste(k, ".csv")
write_as_csv(listjson[k],filename)
}
But the first line in the loop already doesn't work.
> data[k] <- fromJSON(listjson[k])
Warning message:
In `[<-.data.frame`(`*tmp*`, k, value = list(createdAt = c(1505935036000, :
provided 35 variables to replace 1 variables
I can't figure out why. Also, I wonder if there is a nicer way to realize this problem without using a for loop. I read about the apply family, I just don't know how to apply it to my problem. Thanks in advance!
This is an example how my data looks:
https://drive.google.com/file/d/19cRS6p_mHbO6XXprfvc6NPZWuf_zG7jr/view?usp=sharing
It should work like this:
setwd("~/Documents/Elections")
listjson <- dir(pattern = "*.json")
for (k in 1:length(listjson)){
# Load the JSON that correspond to the k element in your list of files
data <- fromJSON(listjson[k])
# Select relevant columns from the dataframe
data <- data[,c(1:3,13,14)]
# Manipulate data
data$lang <- ifelse(data$lang!="de",NA,data$lang)
data <- na.omit(data)
filename <- paste(listjson[k], ".csv")
write_as_csv(data,filename)
}
For the second part of the question, apply applies a function over rows or columns of a dataframe. This is not your case, as you are looping through a vector of character to get filenames to be used somewhere else.
I'm trying to get data from the Lithuanian Statistics Department. They offer SDMX API with either XML or JSON (LSD).
The example XML shown is : https://osp-rs.stat.gov.lt/rest_xml/data/S3R629_M3010217 which downloads the XML file.
I tried following:
devtools::install_github("opensdmx/rsdmx")
library(rsdmx)
string <- "https://osp-rs.stat.gov.lt/rest_xml/data/S3R629_M3010217"
medianage <- readSDMX(string)
which results in error:
<simpleError in doTryCatch(return(expr), name, parentenv, handler): Invalid SDMX-ML file>
I also tried simply reading in the manually downloaded file
devtools::install_github("opensdmx/rsdmx")
library(rsdmx)
medianage <- readSDMX(file="rest_data_M3010217_20180116163251.xml" , isURL = FALSE)
medianage <- as.data.frame(medianage)
results in medianage being NULL (empty)
Maybe soneone has an idea, how I could solve downloading /transforming the data from LSD by using either:
https://osp-rs.stat.gov.lt/rest_xml/data/S3R629_M3010217
https://osp-rs.stat.gov.lt/rest_json/data/S3R629_M3010217
Thanks a lot!
In order to use rsdmx for this datasource, some enhancements have been added (see details at https://github.com/opensdmx/rsdmx/issues/141). You will need re-install rsdmx from Github (version 0.5-11)
You can use the url of the SDMX-ML file
library(rsdmx)
url <- "https://osp-rs.stat.gov.lt/rest_xml/data/S3R629_M3010217"
medianage <- readSDMX(url)
df <- as.data.frame(medianage)
A connector has been added in rsdmx to facilitate data query on the LSD (Lithuanian Statistics Department) SDMX endpoint. See below an example on how to use it.
sdmx <- readSDMX(providerId = "LSD", resource = "data",
flowRef = "S3R629_M3010217", dsd = TRUE)
df <- as.data.frame(sdmx, labels = TRUE)
The above example shows how to enrich the data.frame with code labels extracted from the SDMX Data Structure Definition (DSD). For this, specify dsd = TRUE with readSDMX. This allows then to use labels = TRUE when converting to data.frame. For filtering data with readSDMX, e.g. (startPeriod, endPeriod, code filters), check this page https://github.com/opensdmx/rsdmx/wiki#readsdmx-as-helper-function
I'm trying to scrape data from tranfermrkt using mainly XML + httr package.
page.doc <- content(GET("http://www.transfermarkt.es/george-corral/marktwertverlauf/spieler/103889"))
After downloading, there is a hidden array named 'series':
'series':[{'type':'line','name':'Valor de mercado','data':[{'y':600000,'verein':'CF América','age':21,'mw':'600 miles €','datum_mw':'02/12/2011','x':1322780400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/3631.png?lm=1403472558)'}},{'y':850000,'verein':'Jaguares de Chiapas','age':21,'mw':'850 miles €','datum_mw':'02/06/2012','x':1338588000000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4774_1441956822.png?lm=1441956822)'}},{'y':1000000,'verein':'Jaguares de Chiapas','age':22,'mw':'1,00 mill. €','datum_mw':'03/12/2012','x':1354489200000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4774_1441956822.png?lm=1441956822)'}},{'y':1000000,'verein':'Jaguares de Chiapas','age':22,'mw':'1,00 mill. €','datum_mw':'29/05/2013','x':1369778400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4774_1441956822.png?lm=1441956822)'}},{'y':1250000,'verein':'Querétaro FC','age':23,'mw':'1,25 mill. €','datum_mw':'27/12/2013','x':1388098800000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4961.png?lm=1409989898)'}},{'y':1500000,'verein':'Querétaro FC','age':24,'mw':'1,50 mill. €','datum_mw':'01/09/2014','x':1409522400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4961.png?lm=1409989898)'}},{'y':1800000,'verein':'Querétaro FC','age':25,'mw':'1,80 mill. €','datum_mw':'01/10/2015','x':1443650400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4961.png?lm=1409989898)'}}]}]
Is there a way to download directly? I want to scrape 600+ pages.
Until now, I have tried
page.doc.2 <- xpathSApply(page.doc, "//*/div[#class='eight columns']")
page.doc.2 <- xpathSApply(page.doc, "//*/div[#class='eight columns']", xmlAttrs)
No, there is no way to download just the JSON data: the JSON array you’re interested in is embedded inside the page’s source code, as part of a script.
You can then use conventional XPath or CSS selectors to find the script elements. However, finding and extracting just the JSON part is harder without a library that evaluates the JavaScript code. A better option would definitely be to use an official API, should one exist.
library(rvest) # Better suited for web scraping than httr & xml.
library(rjson)
doc = read_html('http://www.transfermarkt.es/george-corral/marktwertverlauf/spieler/103889')
script = doc %>%
html_nodes('script') %>%
html_text() %>%
grep(pattern = "'series':", value = TRUE)
# Replace JavaScript quotes with JSON quotes
json_content = gsub("'", '"', gsub("^.*'series':", '', script))
# Truncate characters from the end until the result is parseable as valid JSON …
while (nchar(json_content) > 0) {
json = try(fromJSON(json_content), silent = TRUE)
if (! inherits(json, 'try-error'))
break
json_content = substr(json_content, 1, nchar(json_content) - 1)
}
However, there’s no guarantee that the above will always work: it is JavaScript after all, not JSON; the two are similar but not every valid JavaScript array is valid JSON.
It could be possible to evaluate the JavaScript fragment instead but that gets much more complicated. As a start, take a look at the V8 interface for R.
Is there a way to easily export R tables to a simple HTML page?
The xtable function in the xtable package can export R tables to HTML tables. This blog entry describes how you can create HTML pages from Sweave documents.
It might be worth mentioning that there has been a new package specifically designed to convert (and style with css) data.frames (or tables) into HTML tables in an easy and intuitive way. It is called tableHTML. You can see a simple example below:
library(tableHTML)
#create an html table
tableHTML(mtcars)
#and to export in a file
write_tableHTML(tableHTML(mtcars), file = 'myfile.html')
You can see a detailed tutorial here as well.
Apart from xtable mentioned by #nullglob there are three more packages that might come handy here:
R2HTML
HTMLUtils
hwriter
The grammar of tables package gt is also an option.
Here's the example from the docs for generating a HTML table:
library(gt)
tab_html <-
gtcars %>%
dplyr::select(mfr, model, msrp) %>%
dplyr::slice(1:5) %>%
gt() %>%
tab_header(
title = md("Data listing from **gtcars**"),
subtitle = md("`gtcars` is an R dataset")
) %>%
as_raw_html()
In an issue for the DT package, someone posted how to use the DT package to get html in tables. I've pasted the relevant example code, modifying it to reference all columns with: targets = "_all".
library(DT)
render <- c(
"function(data, type, row){",
" if(type === 'sort'){",
" var parser = new DOMParser();",
" var doc = parser.parseFromString(data, 'text/html');",
" data = doc.querySelector('a').innerText;",
" }",
" return data;",
"}"
)
dat <- data.frame(
a = c("AAA", "BBB", "CCC"),
b = c(
'aaaaa',
'bbbbb',
'jjjjj'
)
)
datatable(
dat,
escape = FALSE,
options = list(
columnDefs = list(
list(targets = "_all", render = JS(render))
)
)
)
I hope this helps.