I need some help extracting affiliation information from PubMed search strings in R. I have already successfully extracted affiliation information from a single PubMed ID XML, but now I have a search string of multiple terms that I need to extract the affiliation information from with hope of then creating a data frame with columns such as: PMID, author, country, state etc.
This is my code so far:
my_query <- (PubMed Search String)
my_entrez_id <- get_pubmed_ids(my_query)
my_abstracts_txt <- fetch_pubmed_data(my_entrez_id, format = "abstract")
The PubMed search string is very long, hence why I haven't included it here. The main aim is therefore to produce a dataframe from this search string which is a table clearly showing affiliation and other general information from the PubMed articles.
Any help would be greatly appreciated!
Have you tried the pubmedR package? https://cran.rstudio.com/web/packages/pubmedR/index.html
library(pubmedR)
library(purrr)
library(tidyr)
my_query <- '(((("diabetes mellitus"[MeSH Major Topic]) AND ("english"[Language])) AND (("2020/01/01"[Date - Create] : "3000"[Date - Create]))) AND ("coronavirus"[MeSH Major Topic])'
my_request <- pmApiRequest(query = my_query,
limit = 5)
You can use the built in function my_pm_df <- pmApi2df(my_request) but this will not provide affiliations for all authors.
You can use a combination of pluck() and map() from purrr to extract what you need into a tibble.
auth <- pluck(my_request, "data") %>% {
tibble(
pmid = map_chr(., pluck, "MedlineCitation", "PMID", "text"),
author_list = map(., pluck, "MedlineCitation", "Article", "AuthorList")
)
}
All author data is contained in that nested list, in the Author$AffiliationInfo list (note it is a list because one author can have multiple affiliations).
=================================================
EDIT based on comments:
First construct your request URLs. Make sure you replace &email with your email address:
library(httr)
library(xml2)
mypmids <- c("32946812", "32921748", "32921727", "32921708", "32911500",
"32894970", "32883566", "32880294", "32873658", "32856805",
"32856803", "32820143", "32810084", "32809963", "32798472")
my_query <- paste0("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=",
mypmids,
"&retmode=xml&email=MYEMAIL#MYDOMAIN.COM")
I like to wrap my API requests in safely to catch any errors. Then use map to loop through the my_query vector. Note we Sys.sleep for 5 seconds after each request to comply with PubMed's rate limit. You can probably cut this down a bit seconds or even less, check in the API documentation.
get_safely <- safely(GET)
my_req <- map(my_query, function(z) {
print(z)
req <- get_safely(url = z)
Sys.sleep(5)
return(req)
})
Next we parse the request with content() in read_xml(). Note that we are parsing the result:
my_resp <- map(my_req, function(z) {
read_xml(content(z$result,
as = "text",
encoding = "UTF-8"))
})
This can probably be cleaned up some but it works. Coerce the AuthorInfo to a list and use a combination of map() , pluck() and unnest(). Note that a given author might have more than one affiliation but am only plucking the first one.
my_pm_list <- map(my_resp, function (z) {
my_xml <- xml_child(xml_child(z, 1), 1)
pmid <- xml_text(xml_find_first(my_xml, "//PMID"))
authinfo <- as_list(xml_find_all(my_xml, ".//AuthorList"))
return(list(pmid, authinfo))
})
myauthinfo <- map(my_pmids, function(z) {
auth <- z[[2]][[1]]
})
mytibble <- myauthinfo %>% {
tibble(
lastname = map_depth(., 2, pluck, "LastName", 1, .default = NA_character_),
firstname = map_depth(., 2, pluck, "ForeName", 1, .default = NA_character_),
affil = map_depth(., 2, pluck, "AffiliationInfo", "Affiliation", 1, .default = NA_character_)
)
}
my_unnested_tibble <- mytibble %>%
bind_cols(pmid = map_chr(my_pm_list, pluck, 1)) %>%
unnest(c(lastname, firstname, affil))
Related
I was provided with a list of identifiers (in this case the identifier is called an NPI). These identifiers can be copied and pasted to this website (https://npiregistry.cms.hhs.gov/registry/?). I want to return the name of the NPI number, name of the physician, address, phone number, and specialty.
I have over 3,000 identifiers so a copy and paste is not efficient and not easily repeatable for future use.
If possible, I would like to create a list of URLs, pass them into a function, and received a dataframe that provides me with the variables mentioned above (NPI, NAME, ADDRESS, PHONE, SPECIALTY).
I was able to write a function that produces the URLs needed:
Here are some NPI numbers for reference: 1417024746, 1386790517, 1518101096, 1255500625.
This is my code for reading in the file that contains my NPIs
npiList <- c("1417024746", "1386790517", "1518101096", "1255500625")
npiList <- as.list(npiList)
npiList <- unlist(npiList, use.names = FALSE)
This is the function to return the list of URLs:
npiaddress <- function(x){
url <- paste("https://npiregistry.cms.hhs.gov/registry/search-results-
table?number=",x,"&addressType=ANY", sep = "")
return(url)
}
I saved the list to a variable and perhaps this is my downfall:
npi_urls <- npiaddress(npiList)
From here I wrote a function that can accept a single URL, retrieves the data I want and turns it into a dataframe. My issue is that I cannot pass multiple URLs:
npiLookup <- function (x){
url <- x
webpage <- read_html(url)
npi_html <- html_nodes(webpage, "td")
npi <- html_text(npi_html)
npi[4] <- gsub("\r?\n|\r", " ", npi[4])
npi[4] <- gsub("\r?\t|\r", " ", npi[4])
npiFinal <- npi[c(1:2,4:6)]
npiFinal <- as.data.frame(npiFinal)
npiFinal <- t(npiFinal)
npiFinal <- as.data.frame(npiFinal)
names(npiFinal) <- c("NPI", "NAME", "ADDRESS", "PHONE", "SPECIALTY")
return(npiFinal)
}
For example:
If I wanted to get a dataframe for the following identifier (1417024746), I can run this and it works:
x <- npiLookup("https://npiregistry.cms.hhs.gov/registry/search-results-table?number=1417024746&addressType=ANY")
View(x)
My output for the example returns the NPI, NAME, ADDRESS, PHONE, SPECIALTY as desired, but again, I need to do this for several thousand NPI identifiers. I feel like I need a loop within npiLookup. I've also tried to put npi_urls into the npiLookup function but it does not work.
Thank you for any help and for taking the time to read.
You're most of the way there. The final step uses this useful R idiom:
do.call(rbind,lapply(npiList,function(npi) {url=npiaddress(npi); npiLookup(url)}))
do.call is a base R function that applies a function (in this case rbind) to the list produced by lapply. That list is the result of running your npiLookup function on the url produced by your npiaddress for each element of npiList.
A few further comments for future reference should anyone else come upon this question: (1) I don't know why you're doing the as.list, unlist sequence at the beginning; it's redundant and probably unnecessary. (2) The NPI registry provides a programming interface (API) that avoids the need to scrape data from the HTML pages; this might be more robust in the long run. (3) The NPI registry provides the entire dataset as a downloadable file; this might have been an easier way to go.
I would like to make a loop with html_node to catch some the value of nodes (nodes no text), that is, I have some values
library(rvest)
country <- c("Canada", "US", "Japan", "China")
With those values ("Canada","us", ...), I´ve done a loop which creates a URL by pasting each value with "https://en.wikipedia.org/wiki/", after that, with each new html apply read_html(i) and a sequences of codes to catch finally a node with html_nodes ('a.page-link') -yes! a node, not a text- and save that html_nodes (...) as.character in a data.frame (or could be a list).
dff<- NULL
for ( i in country ) {
url<-paste0("https://en.wikipedia.org/wiki/",i)
page<- read_html(url)
b <- page%>%
html_nodes ('h2.flow-title') %>%
html_nodes ('a.page-link') %>%
as.character()
dff<- data.frame(b)
}
The problem is this code only save the data from the last country, that is, run the first country and obtain the html_nodes(saving it), but when run the next country the first data is erased and replace by this new , and so on, obtaining as final result just the dat from the last country.
I would be grateful with your help!
As the comment mentioned this line: dff<- data.frame(b) is over writing dff on each loop iteration. On solution is to create an empty list and append the data to the list.
In this example the list items are named for the country queried.
library(rvest)
country <- c("Canada", "US", "Japan", "China")
#initialize the empty list
dff<- list()
for ( i in country ) {
url<-paste0("https://en.wikipedia.org/wiki/",i)
page<- read_html(url)
b <- page%>%
html_nodes ('h2.flow-title') %>%
html_nodes ('a.page-link') %>%
as.character()
#append new data onto the list
dff[[i]]<- data.frame(b)
}
To access the data, one can use dff$Canada, or lapply to process the entire list.
Note: I ran your example which returned no results, better double check the node ids.
Good morning.
I want to use the the following rest: https://rest.ensembl.org/documentation/info/sequence_id_post
I have the vector object (ids) in R:
> ids
[1] "NM_007294.3:c.932_933insT" "NM_007294.3:c.1883C>T" "NM_007294.3:c.2183A>C"
[4] "NM_007294.3:c.2321C>T" "NM_007294.3:c.4585G>A" "NM_007294.3:c.4681C>A"
I have to put this vector(ids) with more than 200 variables in the body= ids variable (bellow), according to the example of code below, for it works:
Code:
library(httr)
library(jsonlite)
library(xml2)
server <- "https://rest.ensembl.org"
ext <- "/vep/human/hgvs"
r <- POST(paste(server, ext, sep = ""), content_type("application/json"), accept("application/json"), body = '{ "hgvs_notations" : ["NM_007294.3:c.932_933insT", "NM_007294.3:c.1883C>T"] }')
stop_for_status(r)
head(fromJSON(toJSON(content(r))))
I know it's a json format, but when I convert my variable ids to json it's not in the correct format.
Do you have any suggestions?
Thanks for any help.
Leandro
I think that NM_007294.3:c.2321C>T is not a valid query to /sequence/id REST endpoint. It contains a sequence id (NM_007294.3) and a variant (c.2321C>T) and if you understood this literally, you are asking the server a letter T, since this call returns sequences.
Valid query would contain only sequence ids and you can use it like that (provided you have your ids in a vector):
r <- POST(paste(server, ext, sep = ""), content_type("application/json"), accept("application/json"), body = paste0('{ "ids" :', jsonlite::toJSON(ids), ' }')
Depending on the downstream scenario, making your ids unique might help/speed things up.
I am attempting to use the R function readHTMLTable to gather data from the online database at www.racingpost.com. I have a CSV file with 30,000 unique ids which can be used to identify individual horses. Unfortunately a small number of these ids are leading readHTMLTable to return the error:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’
My question is - is it possible to set up a wrapper function that will skip the ids which return NULL values but then continue reading the remaining HTML tables? The reading stops at each NULL value.
What I have tried so far is this:
ids = c(896119, 766254, 790946, 556341, 62736, 660506, 486791, 580134, 0011, 580134)
which are all valid horse ids bar the 0011 which will return a NULL value. Then:
scrapescrape <- function(x) {
link <- paste0("http://www.racingpost.com/horses/horse_home.sd?horse_id=",x)
if (!is.null(readHTMLTable(link, which=2))) {
Frame1 <- readHTMLTable(link, which=2)
}
}
total_data = c(0)
for (id in ids) {
total_data = rbind(total_data, scrapescrape(id))
}
However, I think the error is returned at the if statement which means the function stops when it reaches the first NULL value. Any help would be greatly appreciated - many thanks.
You could analyse the HTML first (inspect the page you get, and find a way to recognise a false result), before reading the HTML table.
But you can also make sure the function returns nothing (NA) when an error is thrown, like so:
library(XML)
scrapescrape <- function(x) {
link <- paste0("http://www.racingpost.com/horses/horse_home.sd?horse_id=",x)
tryCatch(readHTMLTable(link, which=2), error=function(e){NA})
}
}
ids <- c(896119, 766254, 790946, 556341, 62736, 660506, 486791, 580134, 0011, 580134)
lst <- lapply(ids, scrapescrape)
str(lst)
Using rvest you can do:
require(rvest)
require(purrr)
paste0("http://www.racingpost.com/horses/horse_home.sd?horse_id=", ids) %>%
map(possibly(~html_session(.) %>%
read_html %>%
html_table(fill = TRUE) %>%
.[[2]],
NULL)) %>%
discard(is.null)
The last line discards all "failed" attempts. If you want to keep them just drop the last line
I'm trying to write a for loop that will take zip codes, make an API call to a database of Congressional information and then parse out only the parties of congressmen representing at zip code.
The issue is that some of the zip codes have more than one congressman and others have none at all, (an error on the part of the database, I think). That means I need to loop through the count returned by the original pull until there are no more representatives.
The issue is that the number of congressmen representing each zip code is different. Thus, I'd like to be able to write new variable names into my dataframe for each new congressman. That is, if there are 2 congressmen, I'd like to write new columns named "party.1" and "party.2", etc.
I have this code so far and I feel that I'm close, but I'm really stuck on what to do next. Thank you all for your help!
EDIT: I found this way to be easier, but I'm still not getting the results I'm looking for
library(rjson)
library(RCurl)
zips <- (c("10001","92037","90801", "94011")
test <- matrix(nrow=4,ncol=7)
temp <- NULL
tst <- NULL
for (i in 1:length(zips)) {
for (n in length(temp$count)) {
temp <- (fromJSON(getURL(paste('https://congress.api.sunlightfoundation.com/legislators/locate?zip=',
zips[i],'&apikey= 'INSERT YOUR API KEY', sep=""), .opts = list(ssl.verifypeer = FALSE))))
tst <- try(temp$results[[n]]$party, silent=T)
if(is(tst,"try-error"))
test[i,n] <- NA
else
test[i,n] <- (temp$results[[n]]$party)
}
}
install.packages("rsunlight")
library("rsunlight")
zips <- c("10001","92037","90801", "94011")
out <- lapply(zips, function(z) cg_legislators(zip = z))
# results for some only
sapply(out, "[[", "count")
# peek at results for one zip code
head(out[[1]]$results[,1:4])
bioguide_id birthday chamber contact_form
1 S000148 1950-11-23 senate http://www.schumer.senate.gov/Contact/contact_chuck.cfm
2 N000002 1947-06-13 house https://jerroldnadler.house.gov/forms/writeyourrep/default.aspx
3 M000087 1946-02-19 house https://maloney.house.gov/contact-me/email-me
4 G000555 1966-12-09 senate http://www.gillibrand.senate.gov/contact/
You can change as needed within a lapply or for loop to add columns, etc.
To pull out party could be as simple as lapply(zips, function(z) cg_legislators(zip = z)$results$party).