I'm trying to scrape data from tranfermrkt using mainly XML + httr package.
page.doc <- content(GET("http://www.transfermarkt.es/george-corral/marktwertverlauf/spieler/103889"))
After downloading, there is a hidden array named 'series':
'series':[{'type':'line','name':'Valor de mercado','data':[{'y':600000,'verein':'CF América','age':21,'mw':'600 miles €','datum_mw':'02/12/2011','x':1322780400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/3631.png?lm=1403472558)'}},{'y':850000,'verein':'Jaguares de Chiapas','age':21,'mw':'850 miles €','datum_mw':'02/06/2012','x':1338588000000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4774_1441956822.png?lm=1441956822)'}},{'y':1000000,'verein':'Jaguares de Chiapas','age':22,'mw':'1,00 mill. €','datum_mw':'03/12/2012','x':1354489200000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4774_1441956822.png?lm=1441956822)'}},{'y':1000000,'verein':'Jaguares de Chiapas','age':22,'mw':'1,00 mill. €','datum_mw':'29/05/2013','x':1369778400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4774_1441956822.png?lm=1441956822)'}},{'y':1250000,'verein':'Querétaro FC','age':23,'mw':'1,25 mill. €','datum_mw':'27/12/2013','x':1388098800000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4961.png?lm=1409989898)'}},{'y':1500000,'verein':'Querétaro FC','age':24,'mw':'1,50 mill. €','datum_mw':'01/09/2014','x':1409522400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4961.png?lm=1409989898)'}},{'y':1800000,'verein':'Querétaro FC','age':25,'mw':'1,80 mill. €','datum_mw':'01/10/2015','x':1443650400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4961.png?lm=1409989898)'}}]}]
Is there a way to download directly? I want to scrape 600+ pages.
Until now, I have tried
page.doc.2 <- xpathSApply(page.doc, "//*/div[#class='eight columns']")
page.doc.2 <- xpathSApply(page.doc, "//*/div[#class='eight columns']", xmlAttrs)
No, there is no way to download just the JSON data: the JSON array you’re interested in is embedded inside the page’s source code, as part of a script.
You can then use conventional XPath or CSS selectors to find the script elements. However, finding and extracting just the JSON part is harder without a library that evaluates the JavaScript code. A better option would definitely be to use an official API, should one exist.
library(rvest) # Better suited for web scraping than httr & xml.
library(rjson)
doc = read_html('http://www.transfermarkt.es/george-corral/marktwertverlauf/spieler/103889')
script = doc %>%
html_nodes('script') %>%
html_text() %>%
grep(pattern = "'series':", value = TRUE)
# Replace JavaScript quotes with JSON quotes
json_content = gsub("'", '"', gsub("^.*'series':", '', script))
# Truncate characters from the end until the result is parseable as valid JSON …
while (nchar(json_content) > 0) {
json = try(fromJSON(json_content), silent = TRUE)
if (! inherits(json, 'try-error'))
break
json_content = substr(json_content, 1, nchar(json_content) - 1)
}
However, there’s no guarantee that the above will always work: it is JavaScript after all, not JSON; the two are similar but not every valid JavaScript array is valid JSON.
It could be possible to evaluate the JavaScript fragment instead but that gets much more complicated. As a start, take a look at the V8 interface for R.
Related
I am currently trying to scrape the data from the two graphs on following html page (information from two graphs listed there: Forsmark and Ringhals): https://group.vattenfall.com/se/var-verksamhet/vara-energislag/karnkraft/aktuell-karnkraftsproduktion
The data originate from script tags like this (fragment)
<script type="text/javascript">
/*<![CDATA[*/ productionData = JSON.parse("{\"timestamp\":1582642616000,\"powerPlant\":\"Ringhals\", // etc
</script>
I would like to get two dataframes that looks like these:
F1 F2 F3
number number number
and
R1 R2 R3
number number number
I tried to use XML and xpath to parse an html page but did not get anywhere with that.
Do you have any ideas?
Thanks!
Those charts are <iframe>s that load from
https://gvp.vattenfall.com/sweden/produced-power/iframe/forsmark and
https://gvp.vattenfall.com/sweden/produced-power/iframe/ringhals
so you should scrape those two pages directly.
This was an interesting challenge.
It becomes not too hard with rvest and jsonlite, which you will have to install if you don't already have. Both require rtools.
Try this:
library('rvest')
library('jsonlite')
# Load the URL (do the same for the other iframe)
url <- 'https://gvp.vattenfall.com/sweden/produced-power/iframe/forsmark'
# Parse it
webpage <- read_html(url)
# Extract the script element. That's a CSS selector for the specific one that holds the json data
# You can find it in your browser's DevTools by finding the script element
# and right-clicking, choosing Copy > CSS Path/Selector
script_element <- html_nodes(webpage, 'body > section:nth-child(2) > script:nth-child(2)')
# Extract its string content
json = html_text(script_element)
# Clean it up
json = gsub("\n /*<![CDATA[*/\n productionData = JSON.parse(", "", json, fixed=TRUE)
json = gsub(");\n /*]]>*/\n ", "", json, fixed=TRUE)
json = gsub("\"{", "{\"", json, fixed=TRUE)
json = gsub("}\"", "}", json, fixed=TRUE)
json = gsub("{\"\\\"", "{\\\"", json, fixed=TRUE)
# Extract data
data = jsonlite::fromJSON(gsub("\\\"", "\"", json, fixed=TRUE))
Caveat: I'm not really an R expert, there is likely a more elegant way of doing this (particularly the data cleaning portion). But it works.
For historical preservation, that takes this DOM node (the text content of the <script> tag):
"\n /*<![CDATA[*/\n productionData = JSON.parse(\"{\\\"timestamp\\\":1582643336000,\\\"powerPlant\\\":\\\"Forsmark\\\",\\\"blockProductionDataList\\\":[{\\\"name\\\":\\\"F1\\\",\\\"production\\\":998.86194,\\\"percent\\\":99.88619},{\\\"name\\\":\\\"F2\\\",\\\"production\\\":1120.434,\\\"percent\\\":97.8545},{\\\"name\\\":\\\"F3\\\",\\\"production\\\":1189.7126,\\\"percent\\\":99.55754}]}\");\n /*]]>*/\n "
and will result in data of this format
> data
$timestamp
[1] 1.582647e+12
$powerPlant
[1] "Forsmark"
$blockProductionDataList
name production percent
1 F1 997.7902 99.77902
2 F2 1131.6150 98.83100
3 F3 1190.0520 99.58594
I was provided with a list of identifiers (in this case the identifier is called an NPI). These identifiers can be copied and pasted to this website (https://npiregistry.cms.hhs.gov/registry/?). I want to return the name of the NPI number, name of the physician, address, phone number, and specialty.
I have over 3,000 identifiers so a copy and paste is not efficient and not easily repeatable for future use.
If possible, I would like to create a list of URLs, pass them into a function, and received a dataframe that provides me with the variables mentioned above (NPI, NAME, ADDRESS, PHONE, SPECIALTY).
I was able to write a function that produces the URLs needed:
Here are some NPI numbers for reference: 1417024746, 1386790517, 1518101096, 1255500625.
This is my code for reading in the file that contains my NPIs
npiList <- c("1417024746", "1386790517", "1518101096", "1255500625")
npiList <- as.list(npiList)
npiList <- unlist(npiList, use.names = FALSE)
This is the function to return the list of URLs:
npiaddress <- function(x){
url <- paste("https://npiregistry.cms.hhs.gov/registry/search-results-
table?number=",x,"&addressType=ANY", sep = "")
return(url)
}
I saved the list to a variable and perhaps this is my downfall:
npi_urls <- npiaddress(npiList)
From here I wrote a function that can accept a single URL, retrieves the data I want and turns it into a dataframe. My issue is that I cannot pass multiple URLs:
npiLookup <- function (x){
url <- x
webpage <- read_html(url)
npi_html <- html_nodes(webpage, "td")
npi <- html_text(npi_html)
npi[4] <- gsub("\r?\n|\r", " ", npi[4])
npi[4] <- gsub("\r?\t|\r", " ", npi[4])
npiFinal <- npi[c(1:2,4:6)]
npiFinal <- as.data.frame(npiFinal)
npiFinal <- t(npiFinal)
npiFinal <- as.data.frame(npiFinal)
names(npiFinal) <- c("NPI", "NAME", "ADDRESS", "PHONE", "SPECIALTY")
return(npiFinal)
}
For example:
If I wanted to get a dataframe for the following identifier (1417024746), I can run this and it works:
x <- npiLookup("https://npiregistry.cms.hhs.gov/registry/search-results-table?number=1417024746&addressType=ANY")
View(x)
My output for the example returns the NPI, NAME, ADDRESS, PHONE, SPECIALTY as desired, but again, I need to do this for several thousand NPI identifiers. I feel like I need a loop within npiLookup. I've also tried to put npi_urls into the npiLookup function but it does not work.
Thank you for any help and for taking the time to read.
You're most of the way there. The final step uses this useful R idiom:
do.call(rbind,lapply(npiList,function(npi) {url=npiaddress(npi); npiLookup(url)}))
do.call is a base R function that applies a function (in this case rbind) to the list produced by lapply. That list is the result of running your npiLookup function on the url produced by your npiaddress for each element of npiList.
A few further comments for future reference should anyone else come upon this question: (1) I don't know why you're doing the as.list, unlist sequence at the beginning; it's redundant and probably unnecessary. (2) The NPI registry provides a programming interface (API) that avoids the need to scrape data from the HTML pages; this might be more robust in the long run. (3) The NPI registry provides the entire dataset as a downloadable file; this might have been an easier way to go.
Good morning.
I want to use the the following rest: https://rest.ensembl.org/documentation/info/sequence_id_post
I have the vector object (ids) in R:
> ids
[1] "NM_007294.3:c.932_933insT" "NM_007294.3:c.1883C>T" "NM_007294.3:c.2183A>C"
[4] "NM_007294.3:c.2321C>T" "NM_007294.3:c.4585G>A" "NM_007294.3:c.4681C>A"
I have to put this vector(ids) with more than 200 variables in the body= ids variable (bellow), according to the example of code below, for it works:
Code:
library(httr)
library(jsonlite)
library(xml2)
server <- "https://rest.ensembl.org"
ext <- "/vep/human/hgvs"
r <- POST(paste(server, ext, sep = ""), content_type("application/json"), accept("application/json"), body = '{ "hgvs_notations" : ["NM_007294.3:c.932_933insT", "NM_007294.3:c.1883C>T"] }')
stop_for_status(r)
head(fromJSON(toJSON(content(r))))
I know it's a json format, but when I convert my variable ids to json it's not in the correct format.
Do you have any suggestions?
Thanks for any help.
Leandro
I think that NM_007294.3:c.2321C>T is not a valid query to /sequence/id REST endpoint. It contains a sequence id (NM_007294.3) and a variant (c.2321C>T) and if you understood this literally, you are asking the server a letter T, since this call returns sequences.
Valid query would contain only sequence ids and you can use it like that (provided you have your ids in a vector):
r <- POST(paste(server, ext, sep = ""), content_type("application/json"), accept("application/json"), body = paste0('{ "ids" :', jsonlite::toJSON(ids), ' }')
Depending on the downstream scenario, making your ids unique might help/speed things up.
What's the prefered way in R to convert a character (vector) containing non-ASCII characters to html? I would for example like to convert
"ü"
to
"ü"
I am aware that this is possible by a clever use of gsub (but has anyone doen it once and for all?) and I thought that the package R2HTML would do that, but it doesn't.
EDIT: Here is what I ended up using; it can obviously be extended by modifying the dictionary:
char2html <- function(x){
dictionary <- data.frame(
symbol = c("ä","ö","ü","Ä", "Ö", "Ü", "ß"),
html = c("ä","ö", "ü","Ä",
"Ö", "Ü","ß"))
for(i in 1:dim(dictionary)[1]){
x <- gsub(dictionary$symbol[i],dictionary$html[i],x)
}
x
}
x <- c("Buschwindröschen", "Weißdorn")
char2html(x)
This question is pretty old but I couldn't find any straightforward answer... So I came up with this simple function which uses the numerical html codes and works for LATIN 1 - Supplement (integer values 161 to 255). There's probably (certainly?) a function in some package that does it more thoroughly, but what follows is probably good enough for many applications...
conv_latinsupp <- function(...) {
out <- character()
for (s in list(...)) {
splitted <- unlist(strsplit(s, ""))
intvalues <- utf8ToInt(enc2utf8(s))
pos_to_modify <- which(intvalues >=161 & intvalues <= 255)
splitted[pos_to_modify] <- paste0("�", intvalues[pos_to_modify], ";")
out <- c(out, paste0(splitted, collapse = ""))
}
out
}
conv_latinsupp("aeiou", "àéïôù12345")
## [1] "aeiou" "àéïôù12345"
The XML uses a method insertEntities for this, but that method is internal. So you may use it at your own risk, as there are no guarantees that it will remain to operate like this in future versions.
Right now, your code could be accomplished using
char2html <- function(x) XML:::insertEntities(x, c("ä"="auml", "ö"="ouml", …))
The use of a named list instead of a data.frame feels kind of elegant, but doesn't change the core of things. Under the hood, insertEntities calls gsub in much the same way your code does.
If numeric HTML entities are valid in your environment, then you could probably convert all your text into those using utf8ToInt and then turn safely printable ASCII characters back into unescaped form. This would save you the trouble of maintaining a dictionary for your entities.
This question is about a generic mechanism for converting any collection of non-cyclical homogeneous or heterogeneous data structures into a dataframe. This can be particularly useful when dealing with the ingestion of many JSON documents or with a large JSON document that is an array of dictionaries.
There are several SO questions that deal with manipulating deeply nested JSON structures and turning them into dataframes using functionality such as plyr, lapply, etc. All the questions and answers I have found are about specific cases as opposed to offering a general approach for dealing with collections of complex JSON data structures.
In Python and Ruby I've been well-served by implementing a generic data structure flattening utility that uses the path to a leaf node in a data structure as the name of the value at that node in the flattened data structure. For example, the value my_data[['x']][[2]][['y']] would appear as result[['x.2.y']].
If one has a collection of these data structures that may not be entirely homogeneous the key to doing a successful flattening to a dataframe would be to discover the names of all possible dataframe columns, e.g., by taking the union of all keys/names of the values in the individually flattened data structures.
This seems like a common pattern and so I'm wondering whether someone has already built this for R. If not, I'll build it but, given R's unique promise-based data structures, I'd appreciate advice on an implementation approach that minimizes heap thrashing.
Hi #Sim I had cause to reflect on your problem yesterday define:
flatten<-function(x) {
dumnames<-unlist(getnames(x,T))
dumnames<-gsub("(*.)\\.1","\\1",dumnames)
repeat {
x <- do.call(.Primitive("c"), x)
if(!any(vapply(x, is.list, logical(1)))){
names(x)<-dumnames
return(x)
}
}
}
getnames<-function(x,recursive){
nametree <- function(x, parent_name, depth) {
if (length(x) == 0)
return(character(0))
x_names <- names(x)
if (is.null(x_names)){
x_names <- seq_along(x)
x_names <- paste(parent_name, x_names, sep = "")
}else{
x_names[x_names==""] <- seq_along(x)[x_names==""]
x_names <- paste(parent_name, x_names, sep = "")
}
if (!is.list(x) || (!recursive && depth >= 1L))
return(x_names)
x_names <- paste(x_names, ".", sep = "")
lapply(seq_len(length(x)), function(i) nametree(x[[i]],
x_names[i], depth + 1L))
}
nametree(x, "", 0L)
}
(getnames is adapted from AnnotationDbi:::make.name.tree)
(flatten is adapted from discussion here How to flatten a list to a list without coercion?)
as a simple example
my_data<-list(x=list(1,list(1,2,y='e'),3))
> my_data[['x']][[2]][['y']]
[1] "e"
> out<-flatten(my_data)
> out
$x.1
[1] 1
$x.2.1
[1] 1
$x.2.2
[1] 2
$x.2.y
[1] "e"
$x.3
[1] 3
> out[['x.2.y']]
[1] "e"
so the result is a flattened list with roughly the naming structure you suggest. Coercion is avoided also which is a plus.
A more complicated example
library(RJSONIO)
library(RCurl)
json.data<-getURL("http://www.reddit.com/r/leagueoflegends/.json")
dumdata<-fromJSON(json.data)
out<-flatten(dumdata)
UPDATE
naive way to remove trailing .1
my_data<-list(x=list(1,list(1,2,y='e'),3))
gsub("(*.)\\.1","\\1",unlist(getnames(my_data,T)))
> gsub("(*.)\\.1","\\1",unlist(getnames(my_data,T)))
[1] "x.1" "x.2.1" "x.2.2" "x.2.y" "x.3"
R has two packages for dealing with JSON input: rjson and RJSONIO. If I understand correctly what you mean by "collection of non-cyclical homogeneous or heterogeneous data structures", I think either of these packages will import that sort of structure as a list.
You can then flatten that list (into a vector) using the unlist function.
If the list is suitably structured (a non-nested list where each element is the same length) then as.data.frame prvoides an alternative to convert the list to be a data frame.
An example:
(my_data <- list(x = list('1' = 1, '2' = list(y = 2))))
unlist(my_data)
The jsonlite package is a fork of RJSONIO specifically designed to make conversion between JSON and data frames easier. You don't provide any example json data, but I think this might be what you are looking for. Have a look at this blog post or the vignette.
Great answer with the flatten and getnames functions. Took a few minutes to figure out all the options needed to get from a vector of JSON strings to a data.frame, so I thought I'd record that here. Suppose jsonvec is a vector of JSON strings. The following builds a data.frame (data.table) where there is one row per string, and each column corresponds to a different possible leaf node of the JSON tree. Any string missing a particular leaf node is filled with NA.
library(data.table)
library(jsonlite)
parsed = lapply(jsonvec, fromJSON, simplifyVector=FALSE)
flattened = lapply(parsed, flatten) #using flatten from accepted answer
d = rbindlist(flattened, fill=TRUE)
I'm now a big fan of simply:
library(jsonlite)
library(tidyverse)
fromJSON("file_path.json") %>%
unlist() %>%
enframe()
And then potentially, depending on your data, piping that into
%>%
pivot_wider()
Once it's in a flat table shape, there are a load of tools in tidyverse and other R libraries more generally for wrangling things around and e.g., dealing with columns with similar prefixes (which will result from the above pipeline as the parent name of the children within a nested json chunk will be prefixed to the child's name).