For Loop reading API response with existing Data Frame - json

I have a dataframe:
df
NAME ARTISTNAME COL3
1 Everything_Now (continued) Arcade Fire Everything_Now%20(continued)%20Arcade%20Fire
2 Everything Now Arcade Fire Everything%20Now%20Arcade%20Fire
3 Signs of Life Arcade Fire Signs%20of%20Life%20Arcade%20Fire
4 Creature Comfort Arcade Fire Creature%20Comfort%20Arcade%20Fire
5 Peter Pan Arcade Fire Peter%20Pan%20Arcade%20Fire
6 Chemistry Arcade Fire Chemistry%20Arcade%20Fire
My goal is to loop this with Genius Lyric's API to get the lyric url for each value in COL3.
If I were to not loop this and just do it for each song individually, then my output for one would look like this:
genius_url <- "https://api.genius.com/search?q=Everything_Now%20(continued)%20Arcade%20Fire"
getgeniuslyrics <- GET(genius_url, add_headers(Authorization = HeaderValue))
geniuslyrics <- jsonlite::fromJSON(toJSON(content(getgeniuslyrics)))
answer <- data.frame(geniuslyrics$response$hits$result$url[1])
answer
X.https...genius.com.Arcade.fire.everything.now.continued.lyrics.
1 https://genius.com/Arcade-fire-everything-now-continued-lyrics
str(answer)
'data.frame': 1 obs. of 1 variable:
$ X.https...genius.com.Arcade.fire.everything.now.continued.lyrics.: Factor w/ 1 level "https://genius.com/Arcade-fire-everything-now-continued-lyrics": 1
This was my attempt at the for-loop so far but I am getting an error:
for(i in 1:length(df[,3])) {
genius_url <- paste("https://api.genius.com/search?q=",
df3[i,3],
sep="")
getgeniuslyrics <- GET(genius_url, add_headers(Authorization = HeaderValue))
geniuslyrics <- jsonlite::fromJSON(toJSON(content(getgeniuslyrics)))
answer <- data.frame(geniuslyrics$response$hits$result$url[1])
df[i,4] <- answer[1,]
}
The error message I am getting is:
Error in x[...] <- m : replacement has length zero
In addition: There were 26 warnings (use warnings() to see them)
Hope this makes sense. Any help would be great, thanks.

Does your dataframe already have column three or you are to create it from columns 1 and 2? I assumed you have to create the third column given the first and the second.
Try rewriting the one trial in a function like format:
funfun <- function(...){
x=unlist(list(...))
A=paste(unlist(lapply(x,strsplit," ")),collapse = "%20")
genius_url=paste0("https://api.genius.com/search?q=",A)
getgeniuslyrics <- GET(genius_url, add_headers(Authorization = HeaderValue))
geniuslyrics <- jsonlite::fromJSON(toJSON(content(getgeniuslyrics)))
answer <- data.frame(geniuslyrics$response$hits$result$url[1])
answer
}
nom maybe from here you can loop or use apply functions:
apply(df[,1:2],1,funfun)
in the case you have the third column, then your life is easier:
funfun_1 <- function(x){
genius_url=paste0("https://api.genius.com/search?q=",x)
getgeniuslyrics <- GET(genius_url, add_headers(Authorization = HeaderValue))
geniuslyrics <- jsonlite::fromJSON(toJSON(content(getgeniuslyrics)))
answer <- data.frame(geniuslyrics$response$hits$result$url[1])
answer
}
sapply(df[,3],funfun_1)

Related

R web scraping with Rselenium and rvest

I need to scrap this webpage so I could have a data.frame like this:
value01 value02 id
SECTION I LIVE ANIMALS ANIMAL PRODUCTS sectionI
CHAPTER 1 LIVE ANIMALS chap0100000000
0101 Live horses, asses, mules and hinnies : (TN701) 0101000000-1
- Horses : 0101210000-2
0101 21 - - Pure-bred breeding animals (NC018) 0101210000-80
0101 29 - - Other : 0101290000-3
0101 29 10 - - - For slaughter 0101291000-80
0101 29 90 - - - Other 0101299000-80
0101 30 - Asses 0101300000-80
To obtain the first two rows of value01 and value02 I use:
unlist((remDr$getPageSource()[[1]] %>% read_html(encoding = 'UTF-8') %>% html_elements('.section') %>% html_table())[2])
unlist((remDr$getPageSource()[[1]] %>% read_html(encoding = 'UTF-8') %>% html_elements('.chapter') %>% html_table())[2])
To obtain the rest of values of value01 and value02 I use (I need to clean the obtained values after I got them with this code, but I think there is better way to obtain the data):
remDr$getPageSource()[[1]] %>% read_html() %>% html_element(xpath = '//*[#id="div_description"]') %>% html_table()
So my problem now is to get the id column of the data.frame I want and to put it all together. Any advice on how to proceed from here to achieve my goal?
The code you need to run to function the previous examples:
suppressMessages(suppressWarnings(library(RSelenium)))
suppressMessages(suppressWarnings(library(rvest)))
rD <- rsDriver(browser = 'firefox', port = 6000L, verbose = FALSE)
remDr <- rD[['client']]
remDr$navigate('https://ec.europa.eu/taxation_customs/dds2/taric/measures.jsp?Lang=en&Domain=TARIC&Offset=0&ShowMatchingGoods=false&callbackuri=CBU-1&SimDate=20220719')
It is not quite clear to me what you want to scrape exactly from that page, but this is how you can get the data I think you are after.
pg <- remDr$getPageSource()[[1]]
doc <- xml2::read_html(pg)
# first two lines
rvest::html_elements(doc, '#sectionI table , .chapter') |>
rvest::html_table()
# get the data from each further line
lines <- rvest::html_elements(doc, ".evenLine")
data <- rvest::html_table(lines)
ids <- rvest::html_attrs(lines) |> sapply(function(x) x[1])
You'll need to clean the scraped data to your liking.
If this is not what you are looking for, you should clarify your question further.

How do I convert a Tibble to HTML Table in R tidyverse?

I'm wanting a way to convert the results of a pipeline manipulation into a table so it can be rendered as a HTML table in R Markdown.
Sample data:
Category <- sample(1:6, 394400)
Category <- sample(1:6, 394400, replace=TRUE)
Category <- factor(Category,
levels = c(1,2,3,4,5,6),
labels = c("First",
"Second",
"Third",
"Fourth",
"Fifth",
"Sixth"))
data <- data.frame(Category)
Then I build a frequency table using the pipeline:
Table <- data %>%
group_by(Category) %>%
summarise(N= n(), Percent = n()/NROW(data)*100) %>%
mutate(C.Percent = cumsum(Percent))
Which gives me this nice little summary table here:
# A tibble: 6 × 4
Category N Percent C.Percent
<fctr> <int> <dbl> <dbl>
1 First 65853 16.69701 16.69701
2 Second 66208 16.78702 33.48403
3 Third 65730 16.66582 50.14985
4 Fourth 65480 16.60243 66.75228
5 Fifth 65674 16.65162 83.40390
6 Sixth 65455 16.59610 100.00000
However if I try to convert that to a table to then convert to HTML, it tells me it cannot coerce Table to a table. This is the same with data frames as well.
Does anyone know a way, as I'd quite like to customise the appearance of the output?
There are several packages for that. Here are some:
knitr::kable(Table)
htmlTable::htmlTable(Table)
ztable::ztable(as.data.frame(Table))
DT::datatable(Table)
stargazer::stargazer(Table, type = "html")
Each of these has different customization options.

Flatten deep nested json in R

I am trying to use R to convert a nested JSON file into a two dimensional dataframe.
My JSON file has a nested structure. But, the names and properties are the same across levels.
{"name":"A", "value":"1", "c":
[{"name":"a1", "value":"11", "c":
[{"name":"a11", "value":"111"},
{"name":"a12", "value":"112"}]
},
{"name":"a2", "value":"12"}]
}
The desired dataset would look like this. Although the exact column names can be different.
name value c__name c_value c_c_name c_c_value
A 1 a1 11 a11 111
A 1 a1 11 a12 112
A 1 a2 12
The code I have so far flattens the data, but it only seems to work for the first level (see the screenshot of the output).
library(jsonlite)
json_file <- ' {"name":"A", "value":"1", "c":
[{"name":"a1", "value":"11", "c":
[{"name":"a11", "value":"111"},
{"name":"a12", "value":"112"}]
},
{"name":"a2", "value":"12"}]
}'
data <- fromJSON(json_file, flatten = TRUE)
View(data)
I tried multiple packages, including jsonlite and RJSONIO, I spent the last 5 hours 5 hours debugging this and trying various online tutorial, but without success. Thanks for your help!
Firstly, that is some ugly JSON; if you have a way of avoiding it, do so. Consequently, what follows is also pretty ugly—to the degree that I normally wouldn't post it, but I am doing so now in the hope that some of the approaches may be of use. If it offends your eyes, let me know and I'll delete it.
library(jsonlite) # for fromJSON
library(reshape2) # for melt
library(dplyr) # for inner_join, select
jlist <- fromJSON(json_file)
jdf <- as.data.frame(jlist)
jdf$c.value <- as.numeric(jdf$c.value) # fix type
jdf$L1 <- as.integer(factor(jdf$c.name)) # for use as a key with an artifact of melt later *urg, sorry*
ccdf <- melt(jdf$c.c) # get nested list into usable form
names(ccdf)[1:2] <- c('c.c.name', 'c.c.value') # fix names so they won't cause problems with the join
df3 <- inner_join(jdf[, -5], ccdf) # join, take out nested column
df3$c.c.value <- as.numeric(df3$c.c.value) # fix type
df3 <- df3 %>% select(-L1, -c) # get rid of useless columns
which leaves you with
> df3
name value c.name c.value c.c.name c.c.value
1 A 1 a1 11 a11 111
2 A 1 a1 11 a12 112
3 A 1 a2 12 <NA> NA
with reasonably sensible types. The packages used are avoidable, if you like.
Is this scalable? Well, not really, without more of the same mess. If anybody else has a less nasty and more scalable approach for dealing with nasty JSON, please post it; I'd be as grateful as the OP.
I think I figured out a way to do this. It seems to work with larger trees. The idea is to unlist the JSON and use the names attribute of the unlisted elements. In this example, if a node has one parent, the name attribute will start with "c.", if it has a parent and a "grandparent", it will list it as "c.c."...etc. So, the code below uses this structure to find the level of nesting and placing the node in the appropriate columns. The rest of the code adds the attributes of the parent nodes and deletes extra rows generated. I know it is not elegant, but I thought it might be useful for others.
library(stringr)
library(jsonlite)
json_file <- ' {"name":"A", "value":"1", "c":
[{"name":"a1", "value":"11", "c":
[{"name":"a11", "value":"111"},
{"name":"a12", "value":"112"}]
},
{"name":"a2", "value":"12"}]
}'
nestedjson <- fromJSON(json_file, simplifyVector = F) #read the json
nAttrPerNode <- 2 #number of attributes per node
strChild <- "c." #determines level of nesting
unnestedjson <- unlist(nestedjson) #convert JSON to unlist
unnestednames <- attr(unnestedjson, "names") #get the names of the cells
depthTree <- (max(str_count(unnestednames, strChild)) + 1) * nAttrPerNode #maximum tree depth
htTree <- length(unnestednames) / nAttrPerNode #maximum tree height (number of branches)
X <- array("", c(htTree, depthTree))
for (nodeht in 1:htTree){ #iterate through the branches and place the nodes based on the count of strChild in the name attribute
nodeIndex <- nodeht * nAttrPerNode
nodedepth <- str_count(unnestednames[nodeIndex], strChild) + 1
X[nodeht, nodedepth * nAttrPerNode - 1] <- unnestedjson[nodeIndex - 1]
X[nodeht, nodedepth * nAttrPerNode] <- unnestedjson[nodeIndex]
}
for (nodeht in 2:htTree){ #repeat the parent node attributes for the children
nodedepth <- 0
repeat{
nodedepth <- nodedepth + 1
startcol <- nodedepth * nAttrPerNode - 1
endcol <- startcol + nAttrPerNode - 1
if (X[nodeht, startcol] == "" & nodedepth < depthTree/nAttrPerNode){
X[nodeht, startcol:endcol] <- X[nodeht-1, startcol:endcol]
} else {
break()
}
}
}
deleteRows <- NULL #Finally delete the rows that only have the parent attributes for nodes that have children
strBranches <- apply(X, 1, paste, collapse="")
for (nodeht in 1:(htTree-1)){
branch2sub <- substr(strBranches[nodeht+1], 1, nchar(strBranches[nodeht]))
if (strBranches[nodeht]==branch2sub){
deleteRows <- c(deleteRows, nodeht)
}
}
deleteRows
X <- X[-deleteRows,]

non-conformable arguments error from lmer when trying to extract information from the model matrix

I have some longitudinal data from which I'd like to get the predicted means at specified times. The model includes 2 terms, their interaction and a spline term for the time variable. When I try to obtain the predicted means, I get "Error in mm %*% fixef(m4) : non-conformable arguments"
I've used the sleep data set from lmer to illustrate my problem. First, I import the data and create a variable "age" for my interaction
sleep <- as.data.frame(sleepstudy) #get the sleep data
# create fake variable for age with 3 levels
set.seed(1234567)
sleep$age <- as.factor(sample(1:3,length(sleep),rep=TRUE))
Then I run my lmer model
library(lme4)
library(splines)
m4 <- lmer(Reaction ~ Days + ns(Days, df=4) + age + Days:age + (Days | Subject), sleep)
Finally, I create the data and matrix needed to obtain predicted means
#new data frame for predicted means
d <- c(0:9) # make a vector of days = 0 to 9 to obtain predictions for each day
newdat <- as.data.frame(cbind(Days=d, age=rep(c(1:3),length(d))))
newdat$Days <- as.numeric(as.character(newdat$Days))
newdat$age <- as.factor(newdat$age)
# create a matrix
mm<-model.matrix(~Days + ns(Days, df=4) + age + Days:age, newdat)
newdat$pred<-mm%*%fixef(m4)
It's at this point that I get the error:
Error in mm %*% fixef(m4) : non-conformable arguments
I can use predict to get the means
newdat$pred <- predict(m4, newdata=newdat, re.form=NA)
which works fine, but I want to be able to calculate a confidence interval, so I need a conformable matrix.
I read somewhere that the problem may be that lmer creates aliases (I can't find that post). This comment was made with regards to not being able to use effect() for a similar task. I couldn't quite understand how to overcome this problem. Moreover, I recall that post was a little old and hoped the alias problem may no longer be relevant.
If anyone has a suggestion for what I may be doing wrong, I'd appreciate the feedback. Thanks.
There are a couple of things here.
you need to drop columns to make your model matrix commensurate with the fixed effect vector that was actually fitted (i.e., commensurate with the model matrix that was actually used for fitting, after dropping collinear columns)
for additional confusion, you happened to only sample ages 2 and 3 (out of a possible {1,2,3})
I've cleaned up the code a little bit ...
library("lme4")
library("splines")
sleep <- sleepstudy #get the sleep data
set.seed(1234567)
## next line happens to sample only 2 and 3 ...
sleep$age <- as.factor(sample(1:3,length(sleep),rep=TRUE))
length(levels(sleep$age)) ## 2
Fit model:
m4 <- lmer(Reaction ~ Days + ns(Days, df=4) +
age + Days:age + (Days | Subject), sleep)
## message; fixed-effect model matrix is
## rank deficient so dropping 1 column / coefficient
Check fixed effects:
f1 <- fixef(m4)
length(f1) ## 7
f2 <- fixef(m4,add.dropped=TRUE)
length(f2) ## 8
We could use this extended version of the fixed effects (which has an NA value in it), but this would just mess us up by propagating NA values through the computation ...
Check model matrix:
X <- getME(m4,"X")
ncol(X) ## 7
(which.dropped <- attr(getME(m4,"X"),"col.dropped"))
## ns(Days, df = 4)4
## 6
New data frame for predicted means
d <- 0:9
## best to use data.frame() directly, avoid cbind()
## generate age based on *actual* levels in data
newdat <- data.frame(Days=d,
age=factor(rep(levels(sleep$age),length(d))))
Create a matrix:
mm <- model.matrix(formula(m4,fixed.only=TRUE)[-2], newdat)
mm <- mm[,-which.dropped] ## drop redundant columns
## newdat$pred <- mm%*%fixef(m4) ## works now
Added by sianagh: Code to obtain confidence intervals and plot the data:
predFun <- function(x) predict(x,newdata=newdat,re.form=NA)
newdat$pred <- predFun(m4)
bb <- bootMer(m4,
FUN=predFun,
nsim=200)
## nb. this produces an error message on its first run,
## but not on subsequent runs (using the development version of lme4)
bb_ci <- as.data.frame(t(apply(bb$t,2,quantile,c(0.025,0.975))))
names(bb_ci) <- c("lwr","upr")
newdat <- cbind(newdat,bb_ci)
Plot:
plot(Reaction~Days,sleep)
with(newdat,
matlines(Days,cbind(pred,lwr,upr),
col=c("red","green","green"),
lty=2,
lwd=c(3,2,2)))
The error is caused due to the drift component, if you put
allowdrift=FALSE
into your auto.arima prediction it will be fixed.

How to click links onto the next page using RCurl?

I am trying to scrape this table from this website using RCurl. I am able to do this and put it into a nice dataframe using the code:
clinVar <- getURL("http://www.ncbi.nlm.nih.gov/clinvar/?term=BRCA1")
docForm2 <- htmlTreeParse(clinVar,useInternalNodes = T)
xp_expr = "//table[#class= 'jig-ncbigrid docsum_table\']/tbody/tr"
nodes = getNodeSet(docForm2, xp_expr)
extractedData <- xmlToDataFrame(nodes)
colnames(extractedData) <- c("Info","Gene", "Variation","Freq", "Phenotype","Clinical significance","Status", "Chr","Location")
However, I can only extract the data on the first page, and the table spans multiple pages. How do you access data on the next page? I have looked at the HTML code for the website and the region that the "Next" button exists in is here (I believe!):
<a name="EntrezSystem2.PEntrez.clinVar.clinVar_Entrez_ResultsPanel.Entrez_Pager.Page" title="Next page of results" class="active page_link next" href="#" sid="3" page="3" accesskey="k" id="EntrezSystem2.PEntrez.clinVar.clinVar_Entrez_ResultsPanel.Entrez_Pager.Page">Next ></a>
I would like to know how to access this link using getURL, postForm etc. I think I should be doing something like this, to get data from the second page but it's still just giving me the first page:
url <- "http://www.ncbi.nlm.nih.gov/clinvar/?term=BRCA1"
clinVar <- postForm(url,
"EntrezSystem2.PEntrez.clinVar.clinVar_Entrez_ResultsPanel.Entrez_Pager.cPage" ="2")
docForm2 <- htmlTreeParse(clinVar,useInternalNodes = T)
xp_expr = "//table[#class= 'jig-ncbigrid docsum_table\']/tbody/tr"
nodes = getNodeSet(docForm2, xp_expr)
extractedData <- xmlToDataFrame(nodes)
colnames(extractedData) <- c("Info","Gene", "Variation","Freq","Phenotype","Clinical significance","Status", "Chr","Location")
Thanks to anyone who can help.
I would use E-utilities to access data at NCBI instead.
url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=clinvar&term=brca1"
readLines(url)
[1] "<?xml version=\"1.0\" ?>"
[2] "<!DOCTYPE eSearchResult PUBLIC \"-//NLM//DTD eSearchResult, 11 May 2002//EN\" \"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd\">"
[3] "<eSearchResult><Count>1080</Count><RetMax>20</RetMax><RetStart>0</RetStart><QueryKey>1</QueryKey><WebEnv>NCID_1_36649974_130.14.18.34_9001_1386348760_356908530</WebEnv><IdList>"
Pass the QueryKey and WebEnv to esummary and get the XML summary (this changes with each esearch, so copy and paste the new keys into the url below)
url2 <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=clinvar&query_key=1&WebEnv=NCID_1_36649974_130.14.18.34_9001_1386348760_356908530"
brca1 <- xmlParse(url2)
Next, view a single record and then extract the fields you need. You may need to loop through the set if there are 0 to many values assigned to a tag. Others like clinical significance description always have 1 value.
getNodeSet(brca1, "//DocumentSummary")[[1]]
table(xpathSApply(brca1, "//clinical_significance/description", xmlValue) )
Benign conflicting data from submitters not provided other
129 22 6 1
Pathogenic probably not pathogenic probably pathogenic risk factor
508 68 19 43
Uncertain significance
284
Also, there are many packages with E-utilities on github and BioC (rentrez, reutils, genomes and others). Using the genomes package on BioC, this simplifies to
brca1 <- esummary( esearch("brca1", db="clinvar"), parse=FALSE )
Using the e-utilities feature on the NCBI database, see http://www.ncbi.nlm.nih.gov/books/NBK25500/ for more details.
## use eSearch feature in eUtilities to search NCBI for ids corresponding to each row of data.
## note to see all ids, not not just top 10 set retmax to a high number
## to get query id and web env info, set usehistory=y
library(RCurl)
library(XML)
baseSearch <- ("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=") ## eSearch
db <- "clinvar" ## database to query
gene <- "BRCA1" ## gene of interest
query <- paste('[gene]+AND+"','clinsig pathogenic"','[Properties]+AND+"','single nucleotide variant"','[Type of variation]&usehistory=y&retmax=1110',sep="") ## query, see below for details
baseFetch <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=" ## base fetch
searchURL <- paste(baseSearch,db, "&term=",gene,query,sep="")
getSearch <- getURL(searchURL)
searchHTML <- htmlTreeParse(searchURL, useInternalNodes =T)
nodes <- getNodeSet(searchHTML,"//querykey") ## this name "querykey" was extracted from the HTML source code for this page
querykey <- xmlToDataFrame(nodes)
nodes <- getNodeSet(searchHTML,"//webenv") ## this name "webenv" was extracted from the HTML source code for this page
webenv <- xmlToDataFrame(nodes)
fetchURL <- paste(baseFetch,db,"&query_key=",querykey,"&WebEnv=",webenv[[1]],"&rettype=docsum",sep="")
getFetch <- getURL(fetchURL)
fetchHTML <- htmlTreeParse(getFetch, useInternalNodes =T)
nodes <- getNodeSet(fetchHTML, "//position")
extractedDataAll <- xmlToDataFrame(nodes)
colnames(extractedDataAll) <- c("pathogenicSNPs")
print(extractedDataAll)
Please note, I found the query information by going to http://www.ncbi.nlm.nih.gov/clinvar/?term=BRCA1 selecting my filters (pathogenic, etc) and then clicking the advanced button. The most recent filters applied should come up in the main box, I used this for the query.
ClinVar now offers XML download of the whole database so webscraping is not necessary.