R: parse JSON/XML exported compound properties from Pubchem

R: parse JSON/XML exported compound properties from Pubchem - json

I would like to parse all chemical properties of a given compound as given in Pubchem in R, using the JSON (or XML) export facility.
Example: ALPHA-IONONE, pubchem compound ID 5282108
https://pubchem.ncbi.nlm.nih.gov/compound/5282108
library("rjson")
data <- rjson::fromJSON(file="https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/5282108/JSON/?response_type=display")
or
library("RJSONIO")
data <- RJSONIO::fromJSON("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/5282108/JSON/?response_type=display")
will get me a tree of nested lists, but how do I go from this rather complicated list of nested lists to a nice dataframe or list of dataframes?
In this case, what I am after is everything under
3.1 Computed Descriptors
3.2 Other Identifiers
3.3 Synonyms
4.1 Computed Properties
in a single row of a dataframe and each element in a separate named column with multiple items per element (e.g. multiple synonyms) pasted together with a "|" as a delimiter. E.g. in this case something like
pubchemid IUPAC_Name InChI InChI_Key Canonical SMILES Isomeric SMILES CAS EC Number Wikipedia MeSH Synonyms Depositor-Supplied Synonyms Molecular_Weight Molecular_Formula XLogP3 Hydrogen_Bond_Donor_Count ...
5282108 (E)-4-(2,6,6-trimethylcyclohex-2-en-1-yl)but-3-en-2-one InChI=1S/C13H20O/c1-10-6-5-9-13(3,4)12(10)8-7-11(2)14/h6-8,12H,5,9H2,1-4H3/b8-7+ ....
Fields with multiple items, such as Depositor-Supplied Synonyms could be pasted together with a "|", e.g. value could be ALPHA-IONONE|Iraldeine|...
Second, I would also like to import section
4.2.2 Kovats Retention Index
as a dataframe
pubchemid column_class kovats_ri
5282108 Standard non-polar 1413
5282108 Standard non-polar 1417
...
5282108 Semi-standard non-polar 1427
...
(section 4.3.1 GC-MS would have been nice too, but since it only displays the 3 top peaks this is a little useless right now, so I'll skip that)
Anybody any idea how to achieve this in an elegant way?
PS Note that not all these fields will necessarily exist for any given query.
2D structure and some properties can also be obtained from
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/5282108/record/SDF/?record_type=2d&response_type=display
and 3D structure from
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/5282108/record/SDF/?record_type=3d&response_type=display
Data can also be exported as XML, using
https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/5282108/XML/?response_type=display
if that would be any easier
Note: also tried with R package rpubchem, but that one only seems to import a small amount of the available info:
library("rpubchem")
get.cid(5282108)
CID IUPACName CanonicalSmile MolecularFormula MolecularWeight TotalFormalCharge XLogP HydrogenBondDonorCount HydrogenBondAcceptorCount HeavyAtomCount TPSA
2 5282108 (E)-4-(2,6,6-trimethylcyclohex-2-en-1-yl)but-3-en-2-one C13H20O 192.297300 0 3 0 1 14 17 5282108

My proposal works on XML files, because (thanks to XPath) I find them more convenient to traverse and select nodes.
Please note that this is neither fast (took few seconds while testing) nor optimal (I parse each file twice - once for names and the like and once for Kovats Retention Index). But I guess that you will want to parse some set of files once and go ahead with your real business, and premature optimization is the root of all evil.
I have put main tasks into separate functions. If you want to get data for one specific pubchem record, they are ready to use. But if you want to get data from few pubchem records at once, you can define vector of pointers to data and use examples at the bottom to merge results together. In my case, vector contains paths to files on my local disk. URLs are supported as well, although I would discourage them (remember that each site will be requested twice, and if there is greater number of records, you probably want to handle faulty network somehow).
Compound you have linked to has multiple entries on "EC Number". They do differ by ReferenceNumber, but not by Name. I wasn't sure why it is that way and what should I do with it (your sample output contains only one entry for EC Number), so I left this to R. R added suffixes to duplicated values and created EC.Number.1, EC.Number.2 etc. These suffixes do not match with ReferenceNumber in file and probably the same column in master data frame will refer to different ReferenceNumbers for different compounds.
It seems that pubchem uses following format for tags <type>Value[List]. In few places I have hardcoded StringValue, but maybe some compound has different types in the same fields. I usually haven't considered lists, except where it was requested. So further modifications might be needed as more data is thrown at this code.
If you have any questions, please post them in comments. I am not sure whether I should explain that code or what.
library("xml2")
library("data.table")
compound.attributes <- function(file=NULL) {
compound <- read_xml(file)
ns <- xml_ns(compound)
information <- xml_find_all(compound, paste0(
"//d1:TOCHeading[text()='Computed Descriptors'",
" or text()='Other Identifiers'",
" or text()='Synonyms'",
" or text()='Computed Properties']",
"/following-sibling::d1:Section/d1:Information"
), ns)
properties <- sapply(information, function(x) {
name <- xml_text(xml_find_one(x, "./d1:Name", ns))
value <- ifelse(length(xml_find_all(x, "./d1:StringValueList", ns)) > 0,
paste(sapply(
xml_find_all(x, "./d1:StringValueList", ns),
xml_text, trim=TRUE), sep="", collapse="|"),
xml_text(
xml_find_one(x, "./*[contains(name(),'Value')]", ns),
trim=TRUE)
)
names(value) <- name
return(value)
})
rm(compound, information)
properties <- as.list(properties)
properties$pubchemid <- sub(".*/([0-9]+)/?.*", "\\1", file)
return(data.frame(properties))
}
compound.retention.index <- function(file=NULL) {
pubchemid <- sub(".*/([0-9]+)/?.*", "\\1", file)
compound <- read_xml(file)
ns <- xml_ns(compound)
information <- xml_find_all(compound, paste0(
"//d1:TOCHeading[text()='Kovats Retention Index']",
"/following-sibling::d1:Information"
), ns)
indexes <- lapply(information, function(x) {
name <- xml_text(xml_find_one(x, "./d1:Name", ns))
values <- as.numeric(sapply(
xml_find_all(x, "./*[contains(name(), 'NumValue')]", ns),
xml_text))
data.frame(pubchemid=pubchemid,
column_class=name,
kovats_ri=values)
})
return( do.call("rbind", indexes) )
}
compounds <- c("./5282108.xml", "./5282148.xml", "./91754124.xml")
cd <- rbindlist(
lapply(compounds, compound.attributes),
fill=TRUE
)
rti <- do.call("rbind",
lapply(compounds, compound.retention.index))

Related

R : Two Different Methods of Webscraping Produce Two Different Results?

I am trying to scrape the name, address and longitude/latitude coordinates for each name on a website (e.g. www.mywebsite.com). I used the following the code to get the address and name based on this SO post
library(tidyverse)
library(rvest)
library(httr)
library(XML)
# Define function to scrape 1 page
get_info <- function(page_n) {
cat("Scraping page ", page_n, "\n")
page <- paste0("mywebsite.com",
page_n, "?extension") %>% read_html
tibble(title = page %>%
html_elements(".title a") %>%
html_text2(),
adress = page %>%
html_elements(".marker") %>%
html_text2(),
page = page_n)
}
# Apply function to pages 1:10
df_1 <- map_dfr(1:10, get_info)
# Check dimensions
dim(df_1)
[1] 90
Since I did not know how to modify the above code to extract the coordinates, I wrote a separate script to scrape them:
# Recognize pattern in websites
part1 = "www.mywebsite.com"
part2 = c(0:55)
part3 = "?extension"
temp = data.frame(part1, part2, part3)
# Create list of websites
temp$all_websites = paste0(temp$part1, temp$part2, temp$part3)
# Scrape
df_2 <- list()
for (i in 1:10)
{tryCatch({
url_i <-temp$all_websites[i]
page_i <-read_html(url_i)
b_i = page_i %>% html_nodes("head")
listanswer_i <- b_i %>% html_text() %>% strsplit("\\n")
df_2[[i]] <- listanswer_i
print(listanswer_i)
}, error = function(e){})
}
# Extract long/lat from results
lat_long = grep("LatLng", unlist(df_2[]), value = TRUE)
df_2 = data.frame(str_match(lat_long, "LatLng(\\s*(.*?)\\s*);"))
In the end, scraping the first 10 pages for name/address resulted in 90 entries, but scraping the same 10 pages for the longitude/latitude resulted in 96 entries:
dim(df_1)
[1] 90
dim(df_2)
[1] 96 3
Can someone please help me understand why this is happening and what can I do to fix this?
In the end, I would to make a final table (using df_1 and df_2) that looks something like this:
id name address long lat
1 1 name1 address1 long1 lat1
2 2 name2 address2 long2 lat2
3 3 name3 address3 long3 lat3
Thanks!
Note: I understand that its possible that some names might be missing their latitude/longitudes, and it might not be possible to have the dimensions of "df_1" match the dimensions of "df_2". If this is the case, would it be somehow possible to find out which names are missing their latitude/longitudes (e.g. replace the latitude/longitude entries with NULL for those cases)? For example - suppose the latitude/longitude was not available for "name3":
id name address long lat
1 1 name1 address1 long1 lat1
2 2 name2 address2 long2 lat2
3 3 name3 address3 NA NA

The Problem
The problem is that your second code snippet is not filtering out strings that contain "LatLng" but do not provide coordinates.
After your second code snippet finished scaping the pages, you do the following:
lat_long = grep("LatLng", unlist(df_2[]), value = TRUE)
If you look at the output of this with print(lat_long), you would see a bunch of rows with coordinates. In fact, you'd see exactly 90 such rows because that's how many providers appeared on all those pages. However, you'd also see rows with the string "\t\t\t\tvar bounds = new google.maps.LatLngBounds();". If you go back to the raw HTML you grabbed, you'd see this appears occasionally. Accordingly, you need to remove these rows.
I thought that perhaps you accomplished this with the remaining code, but you never actually remove them. For example, the below code just produces an object filled with NA values. I don't think this does what you want:
as.numeric(gsub("([0-9]+).*$", "\\1", lat_long))
Additionally, the below retains those values as well:
data.frame(str_match(lat_long, "LatLng(\\s*(.*?)\\s*);"))
The Solution
You need to drop elements without coordinates. You'll notice that those elements all contain the substring "LatLngBounds();", so you can just filter them out once they're in a data.frame like below, or using regex.
df_2 %>% filter(X1 != "LatLngBounds();")
Note that this will actually produce 86 rows instead of 90. So, now we're actually short 4 rows. This is because you are not actually collecting all of the GPS coordinates for everyone on the provider page. You can know this because every provide has an address in df_1 and the coordinates are simply passing those addresses to the Maps API.
Why aren't you getting all of the coordinates? My guess is two reasons. First, you are scraping coordinate based on the marker substring. This marker indicates markers/pin on the map. Since the number of pins on the map need not equal the number of providers on the page, you will miss some providers. The less likely issue may have to do with the Google Maps API. If you visit the URLs you create to scrape from (example], you'll see in the bottom left that the Google Maps widget contains the error "This page didn't load Google Maps correctly. See the JavaScript console for technical details". If you look at the JS console, you'll see that an invalid Google Maps API Key was provided. This seems like a likely issue since (a) there is one "LatLngBounds" row per page you are scraping and (b) the row after each of those rows contains coordinates that are not necessarily anywhere near the providers (mine initializes in the U.S. West Coast while the providers are in Canada). I don't know if this has any consequence, but it would explain it if the marker issue isn't the driver.
However, all of this is mostly irrelevant since you don't even need to scrape the coordinates in the first place. You have a list of addresses: you can GeoCode them yourself! There are different ways of doing this, but you can replicate what the site is doing by simply passing them to the Google Maps API! For step-by-step instructions on how to do this, see here.
Identifying the Problem
To provide a better idea of how to approach similar problems in the future, I'll show how I worked through this. One way to approach issue like this is to start by ruling out possible explanations.
Why the problem isn't "missing coordinates"
If the issue was that names are missing coordinates, we would expect nrow(df1) > nrow(df2). However, you reported the opposite: nrow(df2) > nrow(df1).
Why the problem isn't the first code snippet
Since each page contains 9 providers (at least until the last page) and you are scraping 10 pages, we expect to return 9*10 = 90 elements. As you noted, the first code snippet returns an object with 90 rows while the second code snippet returns an object with 96 rows. The second code snippet must be the issue.
Why the problem isn't the pages
Looking at your code, I noticed that you're scraping different pages. Your code to produce df1 iterates over the values of page_n in the interval 1:10. In contrast, your code to produce df2 iterates over the values of page_n in the interval 0:9. This is because the latter code extracts the values of all_websites at indices 1:10, which happen to be the value 0:9 since all_websites is simply the vector 0:55. Since page_n == 0 returns the same page as page_n == 1, your first code is scaping pages 1:10 and your latter code is scraping page c(1,1:9). This means that the values contained in df1 and df2 will differ.
However, this cannot explain the discrepancy in the dimensionality of the two objects since they would still be expected to return 90 rows!

How to deal with missing row when binding column to data frame (a scraping issue!)

I'm attempting to create data frames by attaching URLs to a scraped HTML table, and then writing these to individual csv files. The data are concerned with the passage of Bills through their respective stages in both the House of Commons and Lords. I've written a function (see below) which reads the tables, parses the HTML code, scrapes the URLS required, binds the two together, extracts the rows concerned with the House of Lords, and then writes the csv files. This function is then run across two lists (one of links to the Bill stage page and another of simplified file names).
library(XML)
lords_tables <- function (x, y) {
tables <- as.data.frame(readHTMLTable(x))
sitePage <- htmlParse(x) # This parses web code
hrefs <- xpathSApply(sitePage, "//td/descendant::a[1]",
xmlGetAttr, 'href') ## First href child of the a nodes
table_bind <- cbind(tables, hrefs)
row_no <- grep(".+: House of Lords|Royal Assent",
table_bind$NULL.V2) #Gives row position of Lords|Royal Assent
lords_rows <- table_bind[grep(".+: House of Lords|Royal Assent", table_bind$NULL.V2), ] # Subsets rows containing House of Lords|Royal Assent
write.csv(lords_rows, file = paste0(y, ".csv"))
}
# x = a list of links to the Bill pages/ y = list of simplified names
mapply(lords_tables, x=link_list, y=gsub_URL)
This works perfectly well for the cases where debates occurred for every stage. However, some cases pose a problem, such as:
browseURL("http://services.parliament.uk/bills/2010-12/armedforces/stages.html")
For this example, no debate occurred at the '3rd reading: House of Commons' and again at the 'Royal Assent'. This results in the following error being returned:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 21, 19
In overcoming this error I'd like to have an NA against the missing stage. Has anyone got an idea of how to achieve this? I'm a relative n00b so feel free to suggest a more elegant approach to the whole problem.
Thanks in advance!

Read all html tables from tennis players activity page

I would like to read all html tables containing Federer's results from this website: http://www.atpworldtour.com/en/players/roger-federer/f324/player-activity
and store the data in one single data frame. One way I figured out was using the rvest package, but as you may notice, my code only works for a specific number of tournaments. Is there any way I can read all relevant tables with one command? Thank you for your help!
Url <- "http://www.atpworldtour.com/en/players/roger-federer/f324/player-activity"
x<- list(length(4))
for (i in 1:4) {
results <- Url %>%
read_html() %>%
html_nodes(xpath=paste0("//table[#class='mega-table'][", i, "]")) %>%
html_table()
results <- results[[1]]
x[[i]] <- resultados
}

Your solution above was close to being the final solution. One downside of your code was having the read_html statement inside the for loop, this would greatly slow down the processing. In the future read the page into a variable and then process the page node by node as necessary.
In this solution, I read the web page into the variable "page" and then extracted the table nodes where class = mega-table. One there, the html_table command returned a list of the tables of interest. The do.call looped a rbind the tables together.
library(rvest)
url <- "http://www.atpworldtour.com/en/players/roger-federer/f324/player-activity"
page<- read_html(url)
tablenodes<-html_nodes(page, "table.mega-table")
tables<-html_table(tablenodes)
#numoftables<-length(tables)
df<-do.call(rbind, tables)

R jsonlite filter records before loading

I have many large json files (3G each) which I want to load efficiently to a strong RServer machine, however loading all record from all files will be redundant and exhausting (50M records multiply by 40). So I thought using jsonlite package because I heard it's efficient. The thing is that I do not need all records but only a subset of records where an embedded element ("source") have an existing field by the name "duration".
This is currently my code:
library(jsonlite)
library(curl)
url <- "https://s3-eu-west-1.amazonaws.com/es-export-data/logstash-2016.02.15.json"
test <- stream_in(url(url))
it's only 1 extract of many. now, jsonlite package have a 'flatten' function to flatten embedded elements to create 1 wide flatten data frame. Then I could filter it. However, it seems not efficient. I think that pre-filter it when the data is loaded is much more efficient.
here a dput of one record:
> dput(test_data)
"{\"_index\":\"logstash-2016.02.15\",\"_type\":\"productLogs\",\"_id\":\"AVLitaOtp4oNFTVKv9tZ\",\"_score\":0,\"_source\":{\"EntryType\":\"Event\",\"queryType\":\"clientQuery\",\"status\":\"success\",\"cubeName\":\"Hourly Targets Operations by Model\",\"cubeID\":\"aHourlyIAAaTargetsIAAaOperationsIAAabyIAAaModel\",\"startQueryTimeStamp\":\"2016-02-15T02:14:23+00:00\",\"endQueryTimeStamp\":\"2016-02-15T02:14:23+00:00\",\"queryResponeLengthBytes\":0,\"duration\":0,\"concurrentQuery\":14,\"action\":\"finishQueryJaql\",\"#timestamp\":\"2016-02-15T02:14:23.253Z\",\"appTypeName\":\"dataserver\",\"#version\":\"1\",\"host\":\"VDED12270\",\"type\":\"productLogs\",\"tags\":[],\"send_type\":\"PullGen1\",\"sisenseuid\":\"janos.kopecek#regenersis.com\",\"sisenseOwnerid\":\"janos.kopecek#regenersis.com\",\"sisenseVersion\":\" 5.8.1.29\",\"sisenseMonitoringVersion\":\"3.0.0.6\",\"inputType\":\"sqs\",\"token\":\"fTdyoSwaFZTalBlnFIlTsqvvzfKZVGle\",\"logstash_host\":\"vpc_cluster_1\"}}"
>
any help appreciated

You have to add an handler function and specify which elements you need:
stream_in(url(url) , handler = function(x) x$"_source$duration")

Filling a Matrix with "For Loop" Taking Too Long

I'm trying to create a data frame that is about 1,000,000 x 5 by using a for-loop, but it's been 5+ hours and I don't think it will finish very soon. I'm using the rjson library to read in the data from a large json file. Can someone help me with filling up this data frame in a faster way?
library(rjson)
# read in data from json file
file <- "/filename"
c <- file(file, "r")
l <- readLines(c, -1L)
data <- lapply(X=l, fromJSON)
# specify variables that i want from this data set
myvars <- c("url", "time", "userid", "hostid", "title")
newdata <- matrix(data[[1]][myvars], 1, 5, byrow=TRUE)
# here's where it goes wrong
for (i in 2:length(l)) {
newdata <- rbind(newdata, data[[i]][myvars])
}
newestdata <- data.frame(newdata)

This is taking forever because each iteration of your loop is creating a new, bigger object. Try this:
slice <- function(field, data) unlist(lapply(data, `[[`, field))
data.frame(Map(slice, myvars, list(data)))
This will create a data.frame and preserve your original data types: character, numeric, etc., if it matters. While forcing everything into a matrix will coerce everything into character class.

Without the data, it's hard to be sure, but there are a couple of things you are doing that are relatively slow. This should be faster, but again, without the data, I can't test:
newdata <- vapply(data, `[`, character(5L), myvars)
I'm also assuming that your data is character, which I think it has to be based on title.
Also, as others have noted, the reason yours is slow is because you are growing an object, which requires R to keep re-allocating memory. vapply will allocate the memory ahead of time because it knows the size of each iterations result, and how many items there are.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008