R: getting data from website, method POST, dropdown menu options change - html

I'm trying to use R to extract data from a website where I have to select information from 5 dropdown menus and then click on an export or consult button (http://200.20.53.7/dadosaguaweb/default.aspx). I found this excellent thread: Getting data in R as dataframe from web source, but it didn't answer my question because of some differences:
1) The website's form's method is Post, not Get;
I tried using the RHTMLForms package together with RCurl, in a way that would work for Post or Get. Namely:
baseURL <- "http://200.20.53.7/dadosaguaweb/default.aspx"
forms<-getHTMLFormDescription(baseURL)
form1<-forms$form1
dadosAgua<-createFunction(form1)
dadosDef<-dadosAgua(75,"PS0421",1979,2015,6309)
2) The website is one of those where the list of options for the second dropdown menu changes according to what you selected for the first one and so on. Therefore, when I set the first input parameter to "75", it does not accept the second one as "PS0421" because that option is not available when the first parameter is at its default value.
So, I tried a step-by-step approach, changing one parameter at a time, like this:
baseURL <- "http://200.20.53.7/dadosaguaweb/default.aspx"
forms1 <- getHTMLFormDescription(baseURL)
form1 <- forms1$form1
dadosAgua1 <- createFunction(form1)
dadosDef1 <- dadosAgua1(75)
forms2 <- getHTMLFormDescription(dadosDef1)
form2 <- forms2$form1
dadosAgua2 <- createFunction(form2)
dadosDef2 <- dadosAgua2(75,"PS0421")
And I get the error message:
Error in function (type, msg, asError = TRUE) : Empty reply from server
Now I'm completely stuck.

I think what you're trying to do is navigation scripting, i.e. getting code to interact with a webpage. It may be complicated to do that programatically, because in order for the fields in the form to change in response to what you click, you have to actually be on a web-browser.
An alternative might be for you to use a tool that can do that for you, like CasperJS, which uses a headless browser, so the page fields can change based on behaviour you script. I don't know how comfortable you are with Javascript, and I don't know of any R packages that can do what casperjs does, so I can't recommend anything else.
Edit:
Take a look at RSelenium

Related

rvest - find html-node with last page number

I'm learning web scraping and created a little exercise for myself to scrape all titles of a recipe site: https://pinchofyum.com/recipes?fwp_paged=1. (I got inspired by this post: https://www.kdnuggets.com/2017/06/web-scraping-r-online-food-blogs.html).
I want to scrape the value of the last page number, which is (at time of writing) number 64. You can find the number of pages at the bottom. I see that this is stored as "a.facetwp-page last", but for some reason cannot access this node. I can see that the page number values are stored as 'data-page', but I'm unable to get this value through 'html_attrs'.
I believe the parent node is "div.facetwp-pager" and I can access that one as follows:
library(rvest)
pg <- read_html("https://pinchofyum.com/recipes")
html_nodes(pg, "div.facetwp-pager")
But this is as far as I get. I guess I'm missing something small, but cannot figure out what it is. I know about Rselenium, but I would like to know if and how to get that last page value (64) with rvest.
Sometimes scraping with rvest doesn't work, especially when the webpage is dynamically generated with java script (I also wasn't able to scrape this info with rvest). In those cases, you can use the RSelenium package. I was able to scrape your desired element like this:
library(RSelenium)
rD <- rsDriver(browser = c("firefox")) #specify browser type you want Selenium to open
remDr <- rD$client
remDr$navigate("https://pinchofyum.com/recipes?fwp_paged=1") # navigates to webpage
webElem <- remDr$findElement(using = "css selector", ".last") #find desired element
txt <- webElem$getElementText() # gets us the HTML
#> txt
#>[[1]]
#>[1] "64"

R: Check existence of url, problems with httr:GET() and url.exists()

I have a list of about 13,000 URLs that I want to extract info from, however, not every URL actually exists. In fact the majority don't. I have just tried passing all 13,000 urls through html() but it takes a long time. I am trying to work out how to see if the urls actually exist before parsing them to html(). I have tried using httr and GET() functions, as well as rcurls and url.exists() functions. For some reason url.exist() always returns FALSE values even when the URL does exist, and the way I am using GET() always returns a success, I think this is because the page is being redirected.
The following URLs represent the type of pages I am parsing, the first does not exist
urls <- data.frame('site' = 1:3, 'urls' = c('https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-1&unit=SLE010',
'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=HMM202',
'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=SLE339'))
urls$urls <- as.character(urls$urls)
For GET(), the problem is that the second URL doesn't actually exist but it is redirected and therefore returns a "success".
urls$urlExists <- sapply(1:length(urls[,1]),
function(x) ifelse(http_status(GET(urls[x, 'urls']))[[1]] == "success", 1, 0))
For url.exists(), I get three FALSE returned even though the first and third urls do exist.
urls$urlExists2 <- sapply(1:length(urls[,1]), function(x) url.exists(urls[x, 'urls']))
I checked these two posts 1, 2 but I would prefer not to use a useragent simply because I am not sure how to find mine or whether it would change for different people using this code on other computers. Therefore making the code harder to pick up and use by others. Both posts answers suggest using GET() in httr. It seems that GET() is probably the preferred method but I would need to figure out how to deal with the redirection issue.
Can anyone suggest a good way in R to test the existence of a URL before parsing them to html()? I would also be happy for any other suggested work around for this issue.
UPDATE:
After looking into the returned value from GET() I figured out a work around, see answers for details.
With httr, use url_success() and redirect following turned off:
library(httr)
urls <- c(
'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-1&unit=SLE010',
'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=HMM202',
'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=SLE339'
)
sapply(urls, url_success, config(followlocation = 0L), USE.NAMES = FALSE)
url_success(x) is deprecated; please use !http_error(x) instead.
So update the solution from hadley.
> library(httr)
>
> urls <- c(
> 'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-1&unit=SLE010',
> 'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=HMM202',
> 'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=SLE339'
> )
>
> !sapply(urls, http_error)
After a suggestion from #TimBiegeleisen I looked at what was returned from the funtion GET(). It seems that if the url exists GET() will return this url as a value, but if it is redirected a different url is returned. I just changed the code to look at whether the url returned by GET() matched the one I submitted.
urls$urlExists <- sapply(1:length(urls[,1]), function(x) ifelse(GET(urls[x, 'urls'])[[1]] == urls[x,'urls'], 1, 0))
I would be interested in learning about any better methods that people use for the same thing.

httr: retrieving data with POST()

Disclaimer: while I have managed to grab data from another source using httr's POST function, let it be known that I am a complete n00b with regards to httr and HTML forms in general.
I would like to bring some data directly into R from a website using httr. My first attempt involved passing a named list to the body arg (as is shown in this vignette). However, I noticed square brackets in the form input names (at least I think they're the form input arguments). So instead, I tried passing in the body as a string as I think it should appear in the request body:
url <- 'http://research.stlouisfed.org/fred2/series/TOTALSA/downloaddata'
query <- paste('form[native_frequency]=Monthly', 'form[units]=lin',
'form[frequency]=Monthly', 'form[obs_start_date]="1976-01-01"',
'form[obs_end_date]="2014-11-01"', 'form[file_format]=txt'
sep = '&')
response <- POST(url, body = query)
In any case, the above code just returns the webpage source code and I cannot figure out how to properly submit the form so that it returns the same data as manually clicking the form's 'Download Data' button.
In Developer Tools/Network on Chrome, it states in the Response Header under Content-Disposition that there is a text file attachment containing the data when I manually click the 'Download Data' button on the form. It doesn't appear to be in any of the headers associated with the response object in the code above. Why isn't this file getting returned by the POST request--where's the file with the data going?
Feels like I'm missing something obvious. Anyone care to help me connect the dots?
Generally if you're going to use httr, you let it build and encode the data for you, you just pass in the information via a list of form values. Try
url<-"http://research.stlouisfed.org/fred2/series/TOTALSA/downloaddata"
query <- list('form[native_frequency]'="Monthly",
'form[units]'="lin",
'form[frequency]'="Monthly",
'form[obs_start_date]'="1996-01-01",
'form[obs_end_date]'="2014-11-01",
'form[file_format]'="txt")
response <- POST(url, body = query)
content(response, "text")
and the return looks something like
[1] "Title: Total Vehicle Sales\r\nSeries ID: TOTALSA\r\nSource:
US. Bureau of Economic Analysis\r\nRelease: Supplemental Estimates, Motor
Vehicles\r\nSeasonal Adjustment: Seasonally Adjusted Annual Rate\r\nFrequency: Monthly\r\nUnits:
Millions of Units\r\nDate Range: 1996-01-01 to 2014-11-
01\r\nLast Updated: 2014-12-05 7:16 AM CST\r\nNotes: \r\n\r\nDATE
VALUE\r\n1996-01-01 14.8\r\n1996-02-01 15.6\r\n1996-03-01 16.0\r\n1996-04-01 15.5\r\n1996-05-01
16.0\r\n1996-06-01 15.3\r\n1996-07-01 15.1\r\n1996-08-01 15.5\r\n1996-09-01 15.5\r\n1996-10-01 15.3\r

Can not query nodes but can see all the nodes and properties in the data browser

I have imported a few nodes with properties via the Cypher CSV import (command below) and the nodes seem to have loaded correctly as I can view them in the REST-API (the data browser). When I execute a MATCH (n) RETURN n query, all of the nodes are displayed in the Results pane, and when I click on one of the nodes the properties are displayed in the left pane of the browser (I would attach a screen shot showing what I am trying to refer to here which would make this issue A LOT more clear and easy to understand, but apparently us neophytes are prohibited from providing such useful information).
However, when I try to query any of the nodes directly, I get no rows returned. By "query the nodes directly" I am referring to querying with a WHERE condition where I ask for a specific property:
MATCH (n)
WHERE n:Type="Idea"
RETURN n
Type is one of the properties of the node. No rows are returned from the query. I can click on the node in the Stream pane to open the properties dialog, and I can see the Type property is clearly "Idea."
Am I missing something? The nodes and properties seemed to have loaded into the DB correctly, but I can't seem to query anything. Is "ID" a restricted term? Do I even need an "ID" property (i thought I read somewhere you shouldn't trust the auto-generated IDs as they aren't guaranteed to be unique over time)?
Import statement used to load the nodes is below:
$ auto-index name, ID
$ import-cypher -i ProjectNodesCSV.csv -o ProjectOut.csv CREATE (n:Project {ID:{ID},Name: {Name}, Type: {Type}, ProjectGroupName: {ProjectGroupName}, ProjectCategoryName: {ProjectCategoryName}, UnifierID: {UnifierID}, StartDate: {StartDate}, EndDate: {EndDate}, CapitalCosts: {CapitalCosts}, OandMCosts: {OandMCosts}}) RETURN ID(n) as ID, n.Name as Name

Extracting population data from website; wiki town webpages

G'day Everyone,
I am looking for a raster layer for human population/habitation in Australia. I have tried finding some free datasets online but couldn't really find anything in a useful formate. I thought it might be interesting to try and scrape population data from wikipedia and make my own raster layer. To this end I have tried getting the info from wiki, but not knowing anything about html has not help me.
The idea is to supply a list of all the towns in Australia that have wiki pages and extract the appropriate data into a data.frame.
I can get the webpage source data into R, but am stuck on how to extract the particular data that I want. The code below shows where I am stuck, any help would be really appreciated or some hints in the right direction.
I thought I might be able to use readHTMLTable() because, in the normal webpage, the info I want is off to the right in a nice table. But when I use this function I get an error (below). Is there any way I can specify this table when I am getting the source info?
Sorry if this question doesn't make much sense, I don't have any idea what I am doing when it comes to searching HTML files.
Thanks for your help, it is greatly appreciated!
Cheers,
Adam
require(RJSONIO)
loc.names <- data.frame(town = c('Sale', 'Bendigo'), state = c('Victoria', 'Victoria'))
u <- paste('http://en.wikipedia.org/wiki/',
sep = '', loc.names[,1], ',_', loc.names[,2])
res <- lapply(u, function(x) htmlParse(x))
Error when I use readHTMLTable:
tabs <- readHTMLTable(res[1])
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"list"’
For instance, some of the data I need looks like this in the html stuff. My question is how do I specify these locations in the HTML stuff I have?
/ <span class="geo">-38.100; 147.067
title="Victoria (Australia)">Victoria</a>. It has a population (2011) of 13,186
res returns a list in this case you need to use res[[1]] rather then res[1] to access its elements.
Using readHTMLTable on these elements will give you all tables. The tables with geo info is contained in a table with class = "infobox vcard" you can just extract these tables seperately then pass them to readHTMLTable
require(XML)
lapply(sapply(res, getNodeSet, path = '//*[#class="infobox vcard"]')
, readHTMLTable)
If you are not familiar with xpaths the selectr package allows you to use css selectors which maybe easier.
require(selectr)
> querySelectorAll(res[[1]], "table span .geo")
[[1]]
<span class="geo">-38.100; 147.067</span>
[[2]]
<span class="geo">-38.100; 147.067</span>