I have a list of about 13,000 URLs that I want to extract info from, however, not every URL actually exists. In fact the majority don't. I have just tried passing all 13,000 urls through html() but it takes a long time. I am trying to work out how to see if the urls actually exist before parsing them to html(). I have tried using httr and GET() functions, as well as rcurls and url.exists() functions. For some reason url.exist() always returns FALSE values even when the URL does exist, and the way I am using GET() always returns a success, I think this is because the page is being redirected.
The following URLs represent the type of pages I am parsing, the first does not exist
urls <- data.frame('site' = 1:3, 'urls' = c('https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-1&unit=SLE010',
'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=HMM202',
'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=SLE339'))
urls$urls <- as.character(urls$urls)
For GET(), the problem is that the second URL doesn't actually exist but it is redirected and therefore returns a "success".
urls$urlExists <- sapply(1:length(urls[,1]),
function(x) ifelse(http_status(GET(urls[x, 'urls']))[[1]] == "success", 1, 0))
For url.exists(), I get three FALSE returned even though the first and third urls do exist.
urls$urlExists2 <- sapply(1:length(urls[,1]), function(x) url.exists(urls[x, 'urls']))
I checked these two posts 1, 2 but I would prefer not to use a useragent simply because I am not sure how to find mine or whether it would change for different people using this code on other computers. Therefore making the code harder to pick up and use by others. Both posts answers suggest using GET() in httr. It seems that GET() is probably the preferred method but I would need to figure out how to deal with the redirection issue.
Can anyone suggest a good way in R to test the existence of a URL before parsing them to html()? I would also be happy for any other suggested work around for this issue.
UPDATE:
After looking into the returned value from GET() I figured out a work around, see answers for details.
With httr, use url_success() and redirect following turned off:
library(httr)
urls <- c(
'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-1&unit=SLE010',
'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=HMM202',
'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=SLE339'
)
sapply(urls, url_success, config(followlocation = 0L), USE.NAMES = FALSE)
url_success(x) is deprecated; please use !http_error(x) instead.
So update the solution from hadley.
> library(httr)
>
> urls <- c(
> 'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-1&unit=SLE010',
> 'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=HMM202',
> 'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=SLE339'
> )
>
> !sapply(urls, http_error)
After a suggestion from #TimBiegeleisen I looked at what was returned from the funtion GET(). It seems that if the url exists GET() will return this url as a value, but if it is redirected a different url is returned. I just changed the code to look at whether the url returned by GET() matched the one I submitted.
urls$urlExists <- sapply(1:length(urls[,1]), function(x) ifelse(GET(urls[x, 'urls'])[[1]] == urls[x,'urls'], 1, 0))
I would be interested in learning about any better methods that people use for the same thing.
Related
I'm trying to use R to extract data from a website where I have to select information from 5 dropdown menus and then click on an export or consult button (http://200.20.53.7/dadosaguaweb/default.aspx). I found this excellent thread: Getting data in R as dataframe from web source, but it didn't answer my question because of some differences:
1) The website's form's method is Post, not Get;
I tried using the RHTMLForms package together with RCurl, in a way that would work for Post or Get. Namely:
baseURL <- "http://200.20.53.7/dadosaguaweb/default.aspx"
forms<-getHTMLFormDescription(baseURL)
form1<-forms$form1
dadosAgua<-createFunction(form1)
dadosDef<-dadosAgua(75,"PS0421",1979,2015,6309)
2) The website is one of those where the list of options for the second dropdown menu changes according to what you selected for the first one and so on. Therefore, when I set the first input parameter to "75", it does not accept the second one as "PS0421" because that option is not available when the first parameter is at its default value.
So, I tried a step-by-step approach, changing one parameter at a time, like this:
baseURL <- "http://200.20.53.7/dadosaguaweb/default.aspx"
forms1 <- getHTMLFormDescription(baseURL)
form1 <- forms1$form1
dadosAgua1 <- createFunction(form1)
dadosDef1 <- dadosAgua1(75)
forms2 <- getHTMLFormDescription(dadosDef1)
form2 <- forms2$form1
dadosAgua2 <- createFunction(form2)
dadosDef2 <- dadosAgua2(75,"PS0421")
And I get the error message:
Error in function (type, msg, asError = TRUE) : Empty reply from server
Now I'm completely stuck.
I think what you're trying to do is navigation scripting, i.e. getting code to interact with a webpage. It may be complicated to do that programatically, because in order for the fields in the form to change in response to what you click, you have to actually be on a web-browser.
An alternative might be for you to use a tool that can do that for you, like CasperJS, which uses a headless browser, so the page fields can change based on behaviour you script. I don't know how comfortable you are with Javascript, and I don't know of any R packages that can do what casperjs does, so I can't recommend anything else.
Edit:
Take a look at RSelenium
Disclaimer: while I have managed to grab data from another source using httr's POST function, let it be known that I am a complete n00b with regards to httr and HTML forms in general.
I would like to bring some data directly into R from a website using httr. My first attempt involved passing a named list to the body arg (as is shown in this vignette). However, I noticed square brackets in the form input names (at least I think they're the form input arguments). So instead, I tried passing in the body as a string as I think it should appear in the request body:
url <- 'http://research.stlouisfed.org/fred2/series/TOTALSA/downloaddata'
query <- paste('form[native_frequency]=Monthly', 'form[units]=lin',
'form[frequency]=Monthly', 'form[obs_start_date]="1976-01-01"',
'form[obs_end_date]="2014-11-01"', 'form[file_format]=txt'
sep = '&')
response <- POST(url, body = query)
In any case, the above code just returns the webpage source code and I cannot figure out how to properly submit the form so that it returns the same data as manually clicking the form's 'Download Data' button.
In Developer Tools/Network on Chrome, it states in the Response Header under Content-Disposition that there is a text file attachment containing the data when I manually click the 'Download Data' button on the form. It doesn't appear to be in any of the headers associated with the response object in the code above. Why isn't this file getting returned by the POST request--where's the file with the data going?
Feels like I'm missing something obvious. Anyone care to help me connect the dots?
Generally if you're going to use httr, you let it build and encode the data for you, you just pass in the information via a list of form values. Try
url<-"http://research.stlouisfed.org/fred2/series/TOTALSA/downloaddata"
query <- list('form[native_frequency]'="Monthly",
'form[units]'="lin",
'form[frequency]'="Monthly",
'form[obs_start_date]'="1996-01-01",
'form[obs_end_date]'="2014-11-01",
'form[file_format]'="txt")
response <- POST(url, body = query)
content(response, "text")
and the return looks something like
[1] "Title: Total Vehicle Sales\r\nSeries ID: TOTALSA\r\nSource:
US. Bureau of Economic Analysis\r\nRelease: Supplemental Estimates, Motor
Vehicles\r\nSeasonal Adjustment: Seasonally Adjusted Annual Rate\r\nFrequency: Monthly\r\nUnits:
Millions of Units\r\nDate Range: 1996-01-01 to 2014-11-
01\r\nLast Updated: 2014-12-05 7:16 AM CST\r\nNotes: \r\n\r\nDATE
VALUE\r\n1996-01-01 14.8\r\n1996-02-01 15.6\r\n1996-03-01 16.0\r\n1996-04-01 15.5\r\n1996-05-01
16.0\r\n1996-06-01 15.3\r\n1996-07-01 15.1\r\n1996-08-01 15.5\r\n1996-09-01 15.5\r\n1996-10-01 15.3\r
I keep getting the same error in my program. I've written a method that takes some messy HTML and turns it into neater strings. This works fine on its own, however when I run the whole program I get the following error:
kamer.rb:9:in `normalise_instrumentation': undefined method `split' for #<Nokogiri::XML::NodeSet:0x007f92cb93bfb0> (NoMethodError)
I'd be really grateful for any info or advice on why this happens and how to stop it.
The code is here:
require 'nokogiri'
require 'open-uri'
def normalise_instrumentation(instrumentation)
messy_array = instrumentation.split('.')
normal_array = []
messy_array.each do |section|
if section =~ /\A\d+\z/
normal_array << section
end
end
return normal_array
end
doc = Nokogiri::HTML(open('http://www.cs.vu.nl/~rutger/vuko/nl/lijst_van_ooit/complete-solo.html'))
table = doc.css('table[summary=works] tr')
work_value = []
work_hash = {}
table.each do |row|
piece = [row.css('td[1]'), row.css('td[2]'), row.css('td[3]')].map { |r|
r.text.strip!
}
work_value = work_value.push(piece)
work_key = normalise_instrumentation(row.css('td[3]'))
work_hash[work_key] = work_value
end
puts work_hash
The problem is here:
row.css('td[3]')
Here's why:
row.css('td[3]').class
# => Nokogiri::XML::NodeSet < Object
You're creating your piece array which then becomes an array of NodeSets, which is probably not what you want, because text against a NodeSet often returns a weird String of concatenated text from multiple nodes. You're not seeing that happen here because you're searching inside a row (<tr>) but if you were to look one level up, in the <table>, you'd have a cocked gun pointed at your foot.
Passing a NodeSet to your normalise_instrumentation method is a problem because NodeSet doesn't have a split method, which is the error you're seeing.
But, it gets worse before it gets better. css, like search and xpath returns a NodeSet, which is akin to an Array. Passing an array-like critter to the method will still result in confusion, because you really want just the Node found, not a set of Nodes. So I'd probably use:
row.at('td[3]')
which will return only the node.
At this point you probably want the text of that node, something like
row.at('td[3]').text
would make more sense because then the method would receive a String, which does have a split method.
However, it appears there are additional problems, because some of the cells you want don't exist, so you'll get nil values also.
This isn't one of my better answers, because I'm still trying to grok what you're doing. Providing us with a minimal example of the HTML you need to parse, and the output you want to capture, will help us fine-tune your code to get what you want.
I had a similar error (undefined method) for a different reason, in my case it was due to an extra dot (put by mistake) like this:
status = data.css.("status font-large").text
where it was fixed by removing the extra dot after the css as shown below
status = data.css("status font-large").text
I hope this helps someone else
I'm trying to grab the href value in <a> HTML tags using Nokogiri.
I want to identify whether they are a path, file, URL, or even a <div> id.
My current work is:
hrefvalue = []
html.css('a').each do |atag|
hrefvalue << atag['href']
end
The possible values in a href might be:
somefile.html
http://www.someurl.com/somepath/somepath
/some/path/here
#previous
Is there a mechanism to identify whether the value is a valid full URL, or file, or path or others?
try URI:
require 'uri'
URI.parse('somefile.html').path
=> "somefile.html"
URI.parse('http://www.someurl.com/somepath/somepath').path
=> "/somepath/somepath"
URI.parse('/some/path/here').path
=> "/some/path/here"
URI.parse('#previous').path
=> ""
Nokogiri is often used with ruby's URI or open-uri, so if that's the case in your situation you'll have access to its methods. You can use that to attempt to parse the URI (using URI.parse). You can also generally use URI.join(base_uri, retrieved_href) to construct the full url, provided you've stored the base_uri.
(Edit/side-note: further details on using URI.join are available here: https://stackoverflow.com/a/4864170/624590 ; do note that URI.join that takes strings as parameters, not URI objects, so coerce where necessary)
Basically, to answer your question
Is there a mechanism to identify whether the value is a valid full
url, or file, or path or others?
If the retrieved_href and the base_uri are well formed, and retrieved_href == the joined pair, then it's an absolute path. Otherwise it's relative (again, assuming well formed inputs).
If you use URI to parse the href values, then apply some heuristics to the results, you can figure out what you want to know. This is basically what a browser has to do when it's about to send a request for a page or a resource.
Using your sample strings:
%w[
somefile.html
http://www.someurl.com/somepath/somepath
/some/path/here
#previous
].each do |u|
puts URI.parse(u).class
end
Results in:
URI::Generic
URI::HTTP
URI::Generic
URI::Generic
The only one that URI recognizes as a true HTTP URI is "http://www.someurl.com/somepath/somepath". All the others are missing the scheme "http://". (There are many more schemes you could encounter. See the specification for more information.)
Of the generic URIs, you can use some rules to sort through them so you'd know how to react if you have to open them.
If you gathered the HREF strings by scraping a page, you can assume it's safe to use the same scheme and host if the URI in question doesn't supply one. So, if you initially loaded "http://www.someurl.com/index.html", you could use "http://www.someurl.com/" as your basis for further requests.
From there, look inside the strings to determine whether they are anchors, absolute or relative paths. If the string:
Starts with # it's an anchor and would be applied to the current page without any need to reload it.
Doesn't contain a path delimiter /, it's a filename and would be added to the currently retrieved URL, substituting the file name, and retrieved. A nice way to do the substitution is to use File.dirname , File.basename and File.join against the string.
Begins with a path delimiter it's an absolute path and is used to replace the path in the original URL. URI::split and URI::join are your friends here.
Doesn't begin with a path delimiter, it's a relative path and is added to the current URI similarly to #2.
Regarding:
hrefvalue = []
html.css('a').each do |atag|
hrefvalue << atag['href']
end
I'd use this instead:
hrefvalue = html.search('a').map { |a| a['href'] }
But that's just me.
A final note: URI has some problems with age and needs an update. It's a useful library but, for heavy-duty URI rippin' apart, I highly recommend looking into using Addressable/URI.
Wrote a very simple function to pull data from the espn api and display in default/index. However default/index is a blank page.
At this point I'm not even trying to parse through the JSON - I just want to see something on my browser.
default.py:
import urllib2
import json
#espn_uri being pulled from models/db.py
def index():
r = urllib2.Request(espn_uri)
opener = urllib2.build_opener()
f = opener.open(r)
status = json.load(f)
return dict(status)
default/index.html:
{{status}}
Thank you!
Try: return dict(status=status)
return dict(status) works because status it itself a dict, and dict(status) just copies it. But it's probably got no key named status, or at least nothing interesting.
And yes, you need =.
As JLundell advises, first return paired data via the dictionary:
return dict(my_status=status)
Second, as you've worked out, use the following to access the returned, rather than the local variable in index.html. Make sure you use the equals sign here or nothing will display
{{=my_status}}
When it comes to JSON, you can return the data using
return my_status.json()
Several other options are available to return data as a list, or to return HTML.
Finally, I recommend that you make use of jQuery and AJAX ($.ajax), so that the AJAX return value can be easily assigned to a JS object. This will also allow you to handle success or errors in the form of JS functions.