Web-Scraping Search Result - html

I am trying to scrape a google search result but my element keeps showing up empty. Here is my code:
url = ("https://www.google.com/search?q=most+popular+place+in+france&rlz=1C1CHBF_enUS875US875&sxsrf=AOaemvIURBlj0HQnAmvIlMyjdcP1WPeh5Q%3A1637686399981&ei=fxydYc2IO42jqtsPkZOVuAw&oq=most+popular+place+in+fr&gs_lcp=Cgdnd3Mtd2l6EAMYADIFCAAQgAQyBQgAEIAEMgYIABAWEB4yCQgAEMkDEBYQHjIGCAAQFhAeMgYIABAWEB4yBggAEBYQHjIGCAAQFhAeMgYIABAWEB4yBggAEBYQHjoECCMQJzoFCAAQkQI6CwguEIAEEMcBENEDOggIABCABBCxAzoLCAAQgAQQsQMQgwE6EQguEIAEELEDEIMBEMcBEKMCOgsILhCABBCxAxCDAToECC4QQzoFCC4QgAQ6CwguEIAEEMcBEK8BOgsILhCABBDHARCjAjoECAAQQzoHCAAQyQMQQzoKCAAQgAQQhwIQFDoICAAQyQMQkQI6DQgAEIAEEIcCELEDEBQ6CwgAEIAEELEDEMkDOgUIABCSAzoICAAQgAQQyQNKBAhBGABQAFiBJGDlLGgBcAJ4AIAB8QGIAa8ckgEGMC4yNC4xmAEAoAEBwAEB&sclient=gws-wiz")
first_page <- read_html(url)
titles <- html_nodes(first_page, xpath = "///div[1]/div/div/span/div[1]/a/div/div[2]/span[1]") %>%
html_text()
I have searched "most popular place in france" into google and am only trying to scrape in the "Eiffel tower" search result. Please help.

Related

Web scraping an embedded table using R

I am currently working a project to scrape the content of the Performance Characteristics table on this website
https://www.ishares.com/uk/individual/en/products/251795/ishares-ftse-100-ucits-etf-inc-fund
The data I am wanting to extract from this table is the 12 m trailing yield of 3.43%
The code I wrote to do this is:
url <- "https://www.ishares.com/uk/individual/en/products/251795/ishares-ftse-100-ucits-etf-inc-fund"
etf_Data <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="fundamentalsAndRisk"]/div') %>%
html_table()
etf_Data <- etf_Data[[1]]
which provided me with an empty list with the error message 'Error in etf_Data[[1]] : subscript out of bounds'
Using Google inspect I have tried various XPaths including reading it in html_text:
url <- "https://www.ishares.com/uk/individual/en/products/251795/ishares-ftse-100-ucits-etf-inc-fund"
etf_Data <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="fundamentalsAndRisk"]/div/div[4]/span[2]') %>%
html_text()
etf_Data <- etf_Data[[1]]
However with no success.
Having gone through other Stack Overflow responses I have not been able to solve my issue.
Would someone be able to assist.
Thank you
C
Couple of things:
There is a different URI you end up at in order to get the content you want. This comes when you manually accept certain conditions on the page
The data you want is not within a table
You can add a queryString with EntryPassthrough parameter = True to get to the right URI and then use :contains and an adjacent sibling combinator to get the desired value
library(rvest)
library(magrittr)
url <- "https://www.ishares.com/uk/individual/en/products/251795/ishares-ftse-100-ucits-etf-inc-fund?switchLocale=y&siteEntryPassthrough=true"
trailing_12m_yield <- url %>%
read_html() %>%
html_element('.caption:contains("12m Trailing Yield") + .data') %>% html_text2()

How to extract the table for each subject from the URL link in R - Webscraping

I'm trying to scrape the table for each subject:
This is the main link https://htmlaccess.louisville.edu/classSchedule/setupSearchClassSchedule.cfm?error=0 It looks like below:
I have to select each subject and click search, which takes to the link https://htmlaccess.louisville.edu/classSchedule/searchClassSchedule.cfm
Each subject gives a different table. For subject Accounting I tried to get the table like below: I used Selector Gadget Chrome extension to get the node string for html_nodes
library(rvest)
library(tidyr)
library(dplyr)
library(ggplot2)
url <- "https://htmlaccess.louisville.edu/classSchedule/searchClassSchedule.cfm"
df <- read_html(url)
tot <- df %>%
html_nodes('table+ table td') %>%
html_text()
But it didn't work:
## show
tot
character(0)
Is there a way to get the tables for each subject in a code with R?
Your problem is that the site requires a web form be submitted - that's what happens when you click the "Search" button on the page. Without submitting that form, you won't be able to access the data. This is evident if you attempt to navigate to the link you're trying to scrape - punch that into your favorite web browser and you'll see that there's no tables at all at "https://htmlaccess.louisville.edu/classSchedule/searchClassSchedule.cfm". No wonder nothing shows up!
Fortunately, you can submit web forms with R. It requires a little bit more code, however. My favorite package for this is httr, which partners nicely with rvest. Here's the code that will submit a form using httr and then proceed with the rest of your code.
library(rvest)
library(dplyr)
library(httr)
request_body <- list(
term="4212",
subject="ACCT",
catalognbr="",
session="none",
genEdCat="none",
writingReq="none",
comBaseCat="none",
sustainCat="none",
starttimedir="0",
starttimehour="08",
startTimeMinute="00",
endTimeDir="0",
endTimeHour="22",
endTimeMinute="00",
location="any",
classstatus="0",
Search="Search"
)
resp <- httr::POST(
url = paste0("https://htmlaccess.louisville.edu/class",
"Schedule/searchClassSchedule.cfm"),
encode = "form",
body = request_body)
httr::status_code(resp)
df <- httr::content(resp)
tot <- df %>%
html_nodes("table+ table td") %>%
html_text() %>%
matrix(ncol=17, byrow=TRUE)
On my machine, that returns a nicely formatted matrix with the expected data. Now, the challenge was figuring out what the heck to put in the request body. For this, I use Chrome's "inspect" tool (right click on a webpage, hit "inspect"). On the "Network" tab of that side panel, you can track what information is being sent by your browser. If I start on the main page and keep that side tab up while I "search" for accounting, I see that the top hit is "searchClassSchedule.cfm" and open that up by clicking on it. There, you can see all the form fields that were submitted to the server and I simply copied those over into R manually.
Your job will be to figure out what shortened name the rest of the departments use! "ACCT" seems to be the one for "Accounting". Once you've got those names in a vector you can loop over them with a for loop or lapply statement:
dept_abbrevs <- c("ACCT", "AIRS")
lapply(dept_abbrevs, function(abbrev){
...code from above...
...after defining message body...
message_body$subject <- abbrev
...rest of the code...
}

Nothing Returning When Using rvest and xpath on an html page

i'm using xpath and rvest for scraping an htm page. Other examples of rvest work well with pipelines, but for this particular script nothing is returned.
webpage <- read_html("https://www.sec.gov/litigation/admin/34-45135.htm")
whomst <- webpage %>% html_nodes(xpath = '/html/body/table[2]/tbody/tr/td[3]/font/p[1]/table/tbody/tr/td[1]/p[2]')
What is returned is :
{xml_nodeset (0)}
Here is a screenshot of the page and the corresponding html
And here's the page that I'm on: https://www.sec.gov/litigation/admin/34-45135.htm. I'm trying to extract the words, "PINNACLE HOLDINGS, INC."
Sometimes chrome tool doesn't give accurate xpath or css, you need to try by yourself, this selector works:
webpage %>% html_nodes("td > p:nth-child(3)") %>% html_text()
result:
[1] "PINNACLE HOLDINGS, INC., \n

Scraping URLs from wikipedia in R yielding only half the URL

I am currently trying to extract the URLs from the wikipedia page with a list of chief executive officers, the code then opens the URLs and copies the text into .txt files for me to use. the trouble is the allurls object only contains the later half of the URL. For example allurls[1] gives ""/wiki/Pierre_Nanterme"". Thus when I run this code
library("xml2")
library("rvest")
url <- "https://en.wikipedia.org/wiki/List_of_chief_executive_officers"
allurls <- url %>% read_html() %>% html_nodes("td:nth-child(2) a") %>%
html_attr("href") %>%
.[!duplicated(.)]%>%lapply(function(x)
read_html(x)%>%html_nodes("body"))%>%
Map(function(x,y)
write_html(x,tempfile(y,fileext=".txt"),options="format"),.,
c(paste("tmp",1:length(.))))
allurls[1]
I get the following error:
Error: '/wiki/Pierre_Nanterme' does not exist.

Get span content using rvest

I'm trying to scrape a set of web pages with rvest package. It works when getting the content of the web pages, but I can't get create time for the first floor, which is 2017-08-17 01:47 for this web page.
url <- read_html("http://tieba.baidu.com/p/5275787419", encoding = "UTF-8")
# This works
contents <- url %>% html_nodes(".d_post_content_firstfloor .clearfix") %>% html_text()
# This doesn't work
create_time <- url %>% html_nodes(".d_post_content_firstfloor li+ li span") %>% html_text()
create_time
character(0)
I want to get the time of first floor on the web but I don't know how to access to it.
One way to achieve this could be
create_time<- url %>% html_nodes(xpath= '//*[#id="j_p_postlist"]/div[1]') %>% xml_attr("data-field")
gsub(".*date\\\":\\\"(.*)\\\",\\\"vote_crypt.*","\\1",create_time)
Output is:
[1] "2017-08-17 01:47"
Hope this helps!