I'm trying to scrape search results from google... I'm trying to get the link from this section:
<h3 class="LC20lb"><div class="ellip">About » Commercial Project Services</div></h3><br><div class="TbwUpd"><cite class="iUh30 bc">https://commercialprojectservices.com.au › about</cite></div>
but when I do:
library(tidyverse)
library(rvest)
library(dplyr)
hp <- "https://www.google.com.au/search?q=commercial+project+site%3A.com.au&oq=commercial+&aqs=chrome.0.69i59j69i57j0l4.3707j0j8&sourceid=chrome&ie=UTF-8"
a <- read_html(hp)
b <- html_nodes(a, "div > div > a")
c <- html_attr(b, "href")
b[16] results in:
b[16]
{xml_nodeset (1)}
[1] <a href="/url?q=https://commercialprojectservices.com.au/scaffold-mananagement/quote/&sa=U&ved=2ahUKEwib3_i48OrkAhUS4KYKHSdzDhYQFjAIegQIBRAB&usg=AOvVaw3blXNQKZnL8P1U-ntgVagX"> ...
The href part seems to have been set to the "ping=" part of the line...
Any ideas where I'm going wrong?
Thanks
Google was sending me garbage because they didn't like my user agent (curl)
faked my user agent and it works now
Related
I am trying to scrape information from HTML webpages, I have the direct links but cannot for some reason get to the relevant text.
These are two examples of the webpages:
http://151.12.58.148:8080/CPC/CPC.detail.html?A00002
http://151.12.58.148:8080/CPC/CPC.detail.html?A00003
After I read the html, I am left with all the source code aside from the relevant text (which should change from page to page).
For example, the first link gives a page with this:
data di nascita 1872
which is coded, when I inspect it on my browser, as:
<p y:role="datasubset" y:arg="DATA_NASCITA" class="smalltitle">
<span class="celllabel">data di nascita</span>
<span y:role="multivaluedcontent" y:arg="DATA_NASCITA">1872</span>
</p>
however, when I read it with my code:
link <- 'http://151.12.58.148:8080/CPC/CPC.detail.html?A00002'
page <- read_html(link)
write.table(as.character(page), "page.txt")
and I print "page", to check what I am getting, the same part of the code is:
<p y:role=\"datasubset\" y:arg=\"NASCITA\" class=\"smalltitle\">
<span class=\"celllabel\">luogo di nascita</span>
<span y:role=\"multivaluedcontent\" y:arg=\"NASCITA\"></span>
</p>
without 1872, which is the piece of information I am interested in.
(and also without not sure if that is indicative of anything).
I can't seem to get around it, would anyone have suggestions?
Thank you very much!
To expand a bit further, the site's HTML code loads a bunch of javascript and then has a template which is filled in after the document loads and also uses the query parameter as some type of value that get computed. I tried to just read in the target javascript file and parse it with V8 but there are too many external dependencies.
To read this, you'll need to use something like splashr or seleniumPipes. I'm partial to the former as I wrote it 😎.
Using either requires running an external program. I will not go into how to install Splash or Selenium in this answer. That's leg work you have to do but splashr makes it pretty easy to use Splash if you are comfortable with Docker.
This bit sets up the necessary packages and starts the Splash server (it will auto-download it first if Docker is available on your system:
library(rvest)
library(splashr)
library(purrr)
start_splash()
This next bit tells Splash to fetch & render the page and then retrieves the page content after javascript has done its work:
splash_local %>%
splash_response_body(TRUE) %>%
splash_user_agent(ua_macos_chrome) %>%
splash_go("http://151.12.58.148:8080/CPC/CPC.detail.html?A00002") %>%
splash_wait(2) %>%
splash_html() -> pg
Unfortunately, it's still a mess. They used namespaces and they are fine in XML docs but somewhat problematic the way they've used them here. But we can work around that with some clever XPath:
html_nodes(pg, "body") %>%
html_nodes(xpath=".//*[local-name()='h4' or local-name()='p' or local-name()='span']/text()") %>%
html_text(trim=TRUE) %>%
discard(`==`, "")
## [1] "Abachisti Vittorio" "data di nascita" "1872"
## [4] "luogo di nascita" "Mirandola, Modena, Emilia Romagna, Italia" "luogo di residenza"
## [7] "Mirandola, Modena, Emilia Romagna, Italia" "colore politico" "socialista"
## [10] "condizione/mestiere/professione" "falegname" "annotazioni riportate sul fascicolo"
## [13] "radiato" "Unità archivistica" "busta"
## [16] "1" "estremi cronologici" "1905-1942"
## [19] "nel fascicolo è presente" "scheda biografica" "A00002"
Do this after you're all done with Splash/splashr to remove the running Docker container:
killall_splash()
I'm trying to scrape the Crossfit Games Open leaderboard. I have a version that worked in previous years, but the website changed and I can't seem to update my code to get it to work with the new site.
The issue I have is I can't seem to get the correct CSS selector to get the athletes name and the link to their profile.
My old code does something similar to this:
library(rvest)
# old site
old_url <- "https://games.crossfit.com/scores/leaderboard.php?stage=1&sort=1&page=1&division=1®ion=0&numberperpage=100&competition=0&frontpage=0&expanded=0&year=16&scaled=0&full=1&showtoggles=0&hidedropdowns=1&showathleteac=1&is_mobile=1"
old_page <- read_html(old_url)
# get the athletes profile url
athlete_link <- html_attr(html_nodes(old_page, "td.name a"), "href")
athlete_name <- html_text(html_nodes(old_page, "td.name a"))
head(athlete_link)
# [1] "http://games.crossfit.com/athlete/124483" "http://games.crossfit.com/athlete/2725" "http://games.crossfit.com/athlete/199938"
# [4] "http://games.crossfit.com/athlete/173837" "http://games.crossfit.com/athlete/2476" "http://games.crossfit.com/athlete/499296"
head(athlete_name)
# [1] "Josh Bridges" "Noah Ohlsen" "Jacob Heppner" "Jonne Koski" "Luke Schafer" "Andrew Kuechler"
# new site
new_url <- "https://games.crossfit.com/leaderboard?page=1&competition=1&year=2017&division=2&scaled=0&sort=0&fittest=1&fittest1=0&occupation=0"
new_page <- read_html(new_url)
# get the athletes profile url
# I would have thought something like this would get it.
# It doens't seem to pull anything
html_attr(html_nodes(new_page, "td.name a.profile-link"), "href")
# character(0)
html_text(html_nodes(new_page, "td.name div.full-name"))
# character(0)
I've tried various other CSS Seclectors, SelectorGadget, and a few other things. I'm experienced in R but this is the only real web scraping project I've ever done so I'm probably missing something very basic.
Which selector should I be using to grab this data?
It looks like the content of this page is generated dynamically with some javascript. You can inspect the source of the page and you'll see something like:
<div class="modal-body">
<!-- dynamically generated content goes here -->
</div>
where the table should go. In these cases, rvest isn't enough.
You can check this recent blog post that has useful pointers: https://rud.is/b/2017/02/09/diving-into-dynamic-website-content-with-splashr/
I am pretty new to webscraping and I am trying to build a scraper that accesses information in the website's source code/html using R.
Specifically, I want to be able to determine whether a (number of) website(s) has an id with a certain text: "google_ads_iframe". The id will always be longer than this, so I think I will have to use a wildcard.
I have tried several options (see below), but so far nothing has worked.
1st method:
doc <- htmlTreeParse("http://www.funda.nl/")
data <- xpathSApply(doc, "//div[contains(#id, 'google_ads_iframe')]", xmlValue, trim = TRUE)
Error message reads:
Error in UseMethod("xpathApply") :
no applicable method for 'xpathApply' applied to an object of class "XMLDocumentContent"
2nd method:
scrapestuff <- scrape(url = "http://www.funda.nl/", parse = T, headers = T)
x <- xpathSApply(scrapestuff[[1]],"//div[contains(#class, 'google_ads_iframe')]",xmlValue)
x returns as an empty list.
3rd method:
scrapestuff <- read_html("http://www.funda.nl/")
hh <- htmlParse(scrapestuff, asText=T)
x <- xpathSApply(hh,"//div[contains(#id, 'google_ads_iframe')]",xmlValue)
Again, x is returned as an empty list.
I can't figure out what I am doing wrong, so any help would be really great!
My ad blocker is probably preventing me from seeing google ads iframes, but you don't have to waste cycles with additional R functions to test for the presence of something. Let the optimized C functions in libxml2 (which underpins rvest and the xml2 package) do the work for you and just wrap your XPath with boolean():
library(xml2)
pg <- read_html("http://www.funda.nl/")
xml_find_lgl(pg, "boolean(.//div[contains(#class, 'featured')])")
## [1] TRUE
xml_find_lgl(pg, "boolean(.//div[contains(#class, 'futured')])")
## [1] FALSE
One other issue you'll need to deal with is that the google ads iframes are most likely being generated after page-load with javascript, which means using RSelenium to grab the page source (you can then use this method with the resultant page source).
UPDATE
I found a page example with google_ads_iframe in it:
pg <- read_html("http://codepen.io/anon/pen/Jtizx.html")
xml_find_lgl(pg, "boolean(.//div[iframe[contains(#id, 'google_ads_iframe')]])")
## [1] TRUE
xml_find_first(pg, "count(.//div[iframe[contains(#id, 'google_ads_iframe')]])")
## [1] 3
That's a rendered page, though, and I suspect you'll still need to use RSelenium to do the page grabbing. Here's how to do that (if you're on a reasonable operating system and have phantomjs installed, otherwise use it with Firefox):
library(RSelenium)
RSelenium::startServer()
phantom_js <- phantom(pjs_cmd='/usr/local/bin/phantomjs', extras=c("--ssl-protocol=any"))
capabilities <- list(phantomjs.page.settings.userAgent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.3")
remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities=capabilities)
remDr$open()
remDr$navigate(URL)
raw_html <- remDr$getPageSource()[[1]]
pg <- read_html()
...
# eventually (when done)
phantom_js$stop()
NOTE
The XPath I used with the codepen example (since it has a google ads iframe) was necessary. Here's the snippet where the iframe exists:
<div id="div-gpt-ad-1379506098645-3" style="width:720px;margin-left:auto;margin-right:auto;display:none;">
<script type="text/javascript">
googletag.cmd.push(function() { googletag.display('div-gpt-ad-1379506098645-3'); });
</script>
<iframe id="google_ads_iframe_/16833175/SmallPS_0" name="google_ads_iframe_/16833175/SmallPS_0" width="723" height="170" scrolling="no" marginwidth="0" marginheight="0" frameborder="0" src="javascript:"<html><body style='background:transparent'></body></html>"" style="border: 0px; vertical-align: bottom;"></iframe></div>
The iframe tag is a child of the div so if you want to target the div first you then have to add the child target if you want to find an attribute in it.
I'm trying to use the read_html function in the rvest package, but have come across a problem I am struggling with.
For example, if I were trying to read in the bottom table that appears on this page, I would use the following code:
library(rvest)
html_content <- read_html("https://projects.fivethirtyeight.com/2016-election-forecast/washington/#now")
By inspecting the HTML code in the browser, I can see that the content I would like is contained in a <table> tag (specifically, it is all contained within <table class="t-calc">). But when I try to extract this using:
tables <- html_nodes(html_content, xpath = '//table')
I retrieve the following:
> tables
{xml_nodeset (4)}
[1] <table class="tippingpointroi unexpanded">\n <tbody>\n <tr data-state="FL" class=" "> ...
[2] <table class="tippingpointroi unexpanded">\n <tbody>\n <tr data-state="NV" class=" "> ...
[3] <table class="scenarios">\n <tbody/>\n <tr data-id="1">\n <td class="description">El ...
[4] <table class="t-desktop t-polls">\n <thead>\n <tr class="th-row">\n <th class="t ...
Which includes some of the table elements on the page, but not the one I am interested in.
Any suggestions on where I am going wrong would be most appreciated!
The table is built dynamically from data in JavaScript variables on the page itself. Either use RSelenium to grab the text of the page after it's rendered and pass the page into rvest OR grab a treasure trove of all the data by using V8:
library(rvest)
library(V8)
URL <- "http://projects.fivethirtyeight.com/2016-election-forecast/washington/#now"
pg <- read_html(URL)
js <- html_nodes(pg, xpath=".//script[contains(., 'race.model')]") %>% html_text()
ctx <- v8()
ctx$eval(JS(js))
race <- ctx$get("race", simplifyVector=FALSE)
str(race) ## output too large to paste here
If they ever change the formatting of the JavaScript (it's an automated process so it's unlikely but you never know) then the RSelenium approach will be better provided they don't change the format of the table structure (again, unlikely, but you never know).
The following is working:
url_a <- getURL("https://www.sec.gov/Archives/edgar/data/6494/000119312504029815/d8k.htm")
dfx_a <- htmlParse(url_a)
The following is not working:
url_b <- getURL("http://www.sec.gov/Archives/edgar/data/1639947/000104746916009939/a2226912zs-11.htm")
dfx_b <- htmlParse(url_b)
Why is this?
Maybe you are missing the S part in http in the second url.
it should be https?