Extracting all (possible) optional date values from web page [R] - html

In this url string, "toDate=1399849199999" part of the string refers to UNIX time expressed in milliseconds which is used to extract the Premier league table for a particular day.
In this case, UNIX time refers to 11. may of 2014.
as.POSIXlt (1399849199999/1000, tz = "GMT", origin = "1970-01-01")
I would like to retrieve all possible UNIX time values for a particular month. For url provided here, those 6 values are stored in webpage source code and it looks like this:
<select name="toDate" id="date" class="selectToSlider" widget="selectToSlider" labels="18" tooltip="false" wrapperClass="selectToSliderWrapper selectToSliderMatchDate"><optgroup label="results"><option value="1399157999999">SAT 03</option><option value="1399244399999">SUN 04</option><option value="1399330799999">MON 05</option><option value="1399417199999" selected="selected">TUE 06</option><option value="1399503599999">WED 07</option><option value="1399849199999">SUN 11</option></optgroup><optgroup label="fixtures"></optgroup></select>
Previously I used to extract such information with regular expressions but it was the pain in the neck (***) and I want to do this in some easier way.
I appreciate if someone can provide the code (possibly with explained steps) that can extract those values using some web scraping packages in R, preferably XML. I tried it by myself but I was unsuccessful...

We can try using XML package to parse the html from the link you provided, then extract the specific information required (out of the whole html) using xpath:
library(XML)
EPL.URL <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2013-2014&month=MAY&timelineView=date&toDate=1399849199999&tableView=CURRENT_STANDINGS"
EPL.doc <- htmlParse(EPL.URL)
xpathSApply(EPLdoc, "//optgroup[#label='results']/option", xmlGetAttr, "value")

rvest makes this pretty easy. Look for the "option" nodes, then grab the "value" attributes.
library("rvest")
h <- read_html('<select name="toDate" id="date" class="selectToSlider" widget="selectToSlider" labels="18" tooltip="false" wrapperClass="selectToSliderWrapper selectToSliderMatchDate"><optgroup label="results"><option value="1399157999999">SAT 03</option><option value="1399244399999">SUN 04</option><option value="1399330799999">MON 05</option><option value="1399417199999" selected="selected">TUE 06</option><option value="1399503599999">WED 07</option><option value="1399849199999">SUN 11</option></optgroup><optgroup label="fixtures"></optgroup></select>')
h %>% html_nodes("option") %>% html_attr("value")
[1] "1399157999999" "1399244399999" "1399330799999"
[4] "1399417199999" "1399503599999" "1399849199999"

Related

R (rvest) Web Scraping Multiple Pages

I am looking to scrape the results from the Philly DA Democratic Primary race. I want to scrape the ward-division results from the website. I need the ward-division number (e.g. 01-01), the name of the candidate (e.g. LARRY KRASNER), and the percent each candidate received. For this website, there are 86 pages of results at the ward-division level:
https://results.philadelphiavotes.com/ResultsSW.aspx?type=CTY&map=CTY#page-1
Using the SelectorGadget tool, the CSS for each are as follows:
ward-division numbers = ".precinct-results-orangebox-title h1"
name of candidates= ".precinct-results-databox1 h1"
percent results= "#Datawrapper 16DEM .bar-percent"
When I tried to initially scrape the website data, I used the following code:
#Read in the Data
daresults <- read_html (https://results.philadelphiavotes.com/ResultsSW.aspx type=CTY&map=CTY#page-1)
#Ward-Division Numbers
warddiv<-daresults %>%
html_nodes(".precinct-results-orangebox-title h1")%>%
html_text()
And I received a response of
character(0)
Any help on cleaning up the code and creating a loop to scrape all 86 pages would be appreciated. Thanks.
It looks like the data is stored as a JSON file. From the Network tab, from your browser's developer tools the files are located here:
https://phillyresws.azurewebsites.us/ResultsAjax.svc/GetMapData?type=CTY&category=PREC&raceID=16&osn=16&county=04&party=DEM&LanguageID=1
https://phillyresws.azurewebsites.us/ResultsAjax.svc/GetMapData?type=CTY&category=PREC&raceID=17&osn=17&county=04&party=REP&LanguageID=1
https://phillyresws.azurewebsites.us/ResultsAjax.svc/GetMapData?type=CTY&category=PREC&raceID=18&osn=18&county=04&party=DEM&LanguageID=1
https://phillyresws.azurewebsites.us/ResultsAjax.svc/GetMapData?type=CTY&category=PREC&raceID=19&osn=19&county=04&party=REP&LanguageID=1
Use jsonlite or another package to read the file and parse the file into a data frame.
For example:
url<-"https://phillyresws.azurewebsites.us/ResultsAjax.svc/GetMapData?type=CTY&category=PREC&raceID=16&osn=16&county=04&party=DEM&LanguageID=1"
jsonlite::fromJSON(url)

rvest - find html-node with last page number

I'm learning web scraping and created a little exercise for myself to scrape all titles of a recipe site: https://pinchofyum.com/recipes?fwp_paged=1. (I got inspired by this post: https://www.kdnuggets.com/2017/06/web-scraping-r-online-food-blogs.html).
I want to scrape the value of the last page number, which is (at time of writing) number 64. You can find the number of pages at the bottom. I see that this is stored as "a.facetwp-page last", but for some reason cannot access this node. I can see that the page number values are stored as 'data-page', but I'm unable to get this value through 'html_attrs'.
I believe the parent node is "div.facetwp-pager" and I can access that one as follows:
library(rvest)
pg <- read_html("https://pinchofyum.com/recipes")
html_nodes(pg, "div.facetwp-pager")
But this is as far as I get. I guess I'm missing something small, but cannot figure out what it is. I know about Rselenium, but I would like to know if and how to get that last page value (64) with rvest.
Sometimes scraping with rvest doesn't work, especially when the webpage is dynamically generated with java script (I also wasn't able to scrape this info with rvest). In those cases, you can use the RSelenium package. I was able to scrape your desired element like this:
library(RSelenium)
rD <- rsDriver(browser = c("firefox")) #specify browser type you want Selenium to open
remDr <- rD$client
remDr$navigate("https://pinchofyum.com/recipes?fwp_paged=1") # navigates to webpage
webElem <- remDr$findElement(using = "css selector", ".last") #find desired element
txt <- webElem$getElementText() # gets us the HTML
#> txt
#>[[1]]
#>[1] "64"

How do I webscrape .dpbox table using selectorgadget with R (rvest)?

I've been trying to webscrape data from a specific website using selectorgadget in R. For example, I successfully webscraped from http://www.dotabuff.com/heroes/abaddon/matchups before. Usually, I just click on the tables I want using the selectorgadget Chrome extension and put the CSS Selection result into the code as follows.
urlx <- "http://www.dotabuff.com/heroes/abaddon/matchups"
rawData <- html_text(html_nodes(read_html(urlx),"td:nth-child(4) , td:nth-child(3), .cell-xlarge"))
In this case, the html_nodes function does return a whole bunch of nodes (340)
{xml_nodeset (340)}
However, when I try to webscrape off http://www.dotapicker.com/heroes/Abaddon using selectorgadget, which turns out to be this code:
urlx <- "http://www.dotapicker.com/heroes/abaddon"
rawData <- html_text(html_nodes(read_html(urlx),".ng-scope:nth-child(1) .ng-scope .ng-binding"))
Unfortunately, no nodes actually show up after the html_nodes function is called, and I get the result
{xml_nodeset (0)}
I feel like this has something to do with the nesting of the table in a drop down box (compared to previously, the table was right on the webpage itself) but I'm not sure how to get around it.
Thank you and I appreciate any help!
It seems like this page load dynamically some data using XHR. In Chrome you can check that by going to inspect and then the network tab. If you do this, you will see that there are a number of json files that are being loaded. You can scrape directly those json files and then parse them to extract the info you need. Here is a quick example:
library(httr)
library(jsonlite)
heroinfo_json <- GET("http://www.dotapicker.com/assets/json/data/heroinfo.json")
heroinfo_flat <- fromJSON(content(heroinfo_json, type = "text"))
#> No encoding supplied: defaulting to UTF-8.
winrates_json <- GET("http://www.dotapicker.com/assets/dynamic/winrates10d.json")
winrates_flat <- fromJSON(content(winrates_json, type = "text"))
#> No encoding supplied: defaulting to UTF-8.

R: parse JSON/XML exported compound properties from Pubchem

I would like to parse all chemical properties of a given compound as given in Pubchem in R, using the JSON (or XML) export facility.
Example: ALPHA-IONONE, pubchem compound ID 5282108
https://pubchem.ncbi.nlm.nih.gov/compound/5282108
library("rjson")
data <- rjson::fromJSON(file="https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/5282108/JSON/?response_type=display")
or
library("RJSONIO")
data <- RJSONIO::fromJSON("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/5282108/JSON/?response_type=display")
will get me a tree of nested lists, but how do I go from this rather complicated list of nested lists to a nice dataframe or list of dataframes?
In this case, what I am after is everything under
3.1 Computed Descriptors
3.2 Other Identifiers
3.3 Synonyms
4.1 Computed Properties
in a single row of a dataframe and each element in a separate named column with multiple items per element (e.g. multiple synonyms) pasted together with a "|" as a delimiter. E.g. in this case something like
pubchemid IUPAC_Name InChI InChI_Key Canonical SMILES Isomeric SMILES CAS EC Number Wikipedia MeSH Synonyms Depositor-Supplied Synonyms Molecular_Weight Molecular_Formula XLogP3 Hydrogen_Bond_Donor_Count ...
5282108 (E)-4-(2,6,6-trimethylcyclohex-2-en-1-yl)but-3-en-2-one InChI=1S/C13H20O/c1-10-6-5-9-13(3,4)12(10)8-7-11(2)14/h6-8,12H,5,9H2,1-4H3/b8-7+ ....
Fields with multiple items, such as Depositor-Supplied Synonyms could be pasted together with a "|", e.g. value could be ALPHA-IONONE|Iraldeine|...
Second, I would also like to import section
4.2.2 Kovats Retention Index
as a dataframe
pubchemid column_class kovats_ri
5282108 Standard non-polar 1413
5282108 Standard non-polar 1417
...
5282108 Semi-standard non-polar 1427
...
(section 4.3.1 GC-MS would have been nice too, but since it only displays the 3 top peaks this is a little useless right now, so I'll skip that)
Anybody any idea how to achieve this in an elegant way?
PS Note that not all these fields will necessarily exist for any given query.
2D structure and some properties can also be obtained from
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/5282108/record/SDF/?record_type=2d&response_type=display
and 3D structure from
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/5282108/record/SDF/?record_type=3d&response_type=display
Data can also be exported as XML, using
https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/5282108/XML/?response_type=display
if that would be any easier
Note: also tried with R package rpubchem, but that one only seems to import a small amount of the available info:
library("rpubchem")
get.cid(5282108)
CID IUPACName CanonicalSmile MolecularFormula MolecularWeight TotalFormalCharge XLogP HydrogenBondDonorCount HydrogenBondAcceptorCount HeavyAtomCount TPSA
2 5282108 (E)-4-(2,6,6-trimethylcyclohex-2-en-1-yl)but-3-en-2-one C13H20O 192.297300 0 3 0 1 14 17 5282108
My proposal works on XML files, because (thanks to XPath) I find them more convenient to traverse and select nodes.
Please note that this is neither fast (took few seconds while testing) nor optimal (I parse each file twice - once for names and the like and once for Kovats Retention Index). But I guess that you will want to parse some set of files once and go ahead with your real business, and premature optimization is the root of all evil.
I have put main tasks into separate functions. If you want to get data for one specific pubchem record, they are ready to use. But if you want to get data from few pubchem records at once, you can define vector of pointers to data and use examples at the bottom to merge results together. In my case, vector contains paths to files on my local disk. URLs are supported as well, although I would discourage them (remember that each site will be requested twice, and if there is greater number of records, you probably want to handle faulty network somehow).
Compound you have linked to has multiple entries on "EC Number". They do differ by ReferenceNumber, but not by Name. I wasn't sure why it is that way and what should I do with it (your sample output contains only one entry for EC Number), so I left this to R. R added suffixes to duplicated values and created EC.Number.1, EC.Number.2 etc. These suffixes do not match with ReferenceNumber in file and probably the same column in master data frame will refer to different ReferenceNumbers for different compounds.
It seems that pubchem uses following format for tags <type>Value[List]. In few places I have hardcoded StringValue, but maybe some compound has different types in the same fields. I usually haven't considered lists, except where it was requested. So further modifications might be needed as more data is thrown at this code.
If you have any questions, please post them in comments. I am not sure whether I should explain that code or what.
library("xml2")
library("data.table")
compound.attributes <- function(file=NULL) {
compound <- read_xml(file)
ns <- xml_ns(compound)
information <- xml_find_all(compound, paste0(
"//d1:TOCHeading[text()='Computed Descriptors'",
" or text()='Other Identifiers'",
" or text()='Synonyms'",
" or text()='Computed Properties']",
"/following-sibling::d1:Section/d1:Information"
), ns)
properties <- sapply(information, function(x) {
name <- xml_text(xml_find_one(x, "./d1:Name", ns))
value <- ifelse(length(xml_find_all(x, "./d1:StringValueList", ns)) > 0,
paste(sapply(
xml_find_all(x, "./d1:StringValueList", ns),
xml_text, trim=TRUE), sep="", collapse="|"),
xml_text(
xml_find_one(x, "./*[contains(name(),'Value')]", ns),
trim=TRUE)
)
names(value) <- name
return(value)
})
rm(compound, information)
properties <- as.list(properties)
properties$pubchemid <- sub(".*/([0-9]+)/?.*", "\\1", file)
return(data.frame(properties))
}
compound.retention.index <- function(file=NULL) {
pubchemid <- sub(".*/([0-9]+)/?.*", "\\1", file)
compound <- read_xml(file)
ns <- xml_ns(compound)
information <- xml_find_all(compound, paste0(
"//d1:TOCHeading[text()='Kovats Retention Index']",
"/following-sibling::d1:Information"
), ns)
indexes <- lapply(information, function(x) {
name <- xml_text(xml_find_one(x, "./d1:Name", ns))
values <- as.numeric(sapply(
xml_find_all(x, "./*[contains(name(), 'NumValue')]", ns),
xml_text))
data.frame(pubchemid=pubchemid,
column_class=name,
kovats_ri=values)
})
return( do.call("rbind", indexes) )
}
compounds <- c("./5282108.xml", "./5282148.xml", "./91754124.xml")
cd <- rbindlist(
lapply(compounds, compound.attributes),
fill=TRUE
)
rti <- do.call("rbind",
lapply(compounds, compound.retention.index))

R - Extracting Tables From Websites Using XML Package

I am trying to replicate the method used in a previous answer here Scraping html tables into R data frames using the XML package for my own work but cannot get the data to extract. The website I am using is:
http://www.footballfanalytics.com/articles/football/euro_super_league_table.html
I just wish to extract a table of each team name and their current rating score. My code is as follows:
library(XML)
theurl <- "http://www.footballfanalytics.com/articles/football/euro_super_league_table.html"
tables <- readHTMLTable(theurl)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
tables[[which.max(n.rows)]]
This produces the error message
Error in tables[[which.max(n.rows)]] :
attempt to select less than one element
Could anyone suggest a solution please? Is there something in this particular site causing this not to work? Or is there a better alternative method I can try? Thanks
Seems as if the data is loaded via javascript. Try:
library(XML)
theurl <- "http://www.footballfanalytics.com/xml/esl/esl.xml"
doc <- xmlParse(theurl)
cbind(team = xpathSApply(doc, "/StatsData/Teams/Team/Name", xmlValue),
points = xpathSApply(doc, "/StatsData/Teams/Team/Points", xmlValue))