I am trying to use rvest to scrape one page of Google Scholar search results into a dataframe of author, paper title, year, and journal title.
The simplified, reproducible example below is code that searches Google Scholar for the example terms "apex predator conservation".
Note: to stay within the Terms of Service, I only want to process the first page of search results that I would get from a manual search. I am not asking about automation to scrape additional pages.
The following code already works to extract:
author
paper title
year
but it does not have:
journal title
I would like to extract the journal title and add it to the output.
library(rvest)
library(xml2)
library(selectr)
library(stringr)
library(jsonlite)
url_name <- 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=apex+predator+conservation&btnG=&oq=apex+predator+c'
wp <- xml2::read_html(url_name)
# Extract raw data
titles <- rvest::html_text(rvest::html_nodes(wp, '.gs_rt'))
authors_years <- rvest::html_text(rvest::html_nodes(wp, '.gs_a'))
# Process data
authors <- gsub('^(.*?)\\W+-\\W+.*', '\\1', authors_years, perl = TRUE)
years <- gsub('^.*(\\d{4}).*', '\\1', authors_years, perl = TRUE)
# Make data frame
df <- data.frame(titles = titles, authors = authors, years = years, stringsAsFactors = FALSE)
df
source: https://stackoverflow.com/a/58192323/8742237
So the output of that code looks like this:
#> titles
#> 1 [HTML][HTML] Saving large carnivores, but losing the apex predator?
#> 2 Site fidelity and sex-specific migration in a mobile apex predator: implications for conservation and ecosystem dynamics
#> 3 Effects of tourism-related provisioning on the trophic signatures and movement patterns of an apex predator, the Caribbean reef shark
#> authors years
#> 1 A Ordiz, R Bischof, JE Swenson 2013
#> 2 A Barnett, KG Abrantes, JD Stevens, JM Semmens 2011
Two questions:
How can I add a column that has the journal title extracted from the raw data?
Is there a reference where I can read and learn more about how to work out how to extract other fields for myself, so I don't have to ask here?
One way to add them is this:
library(rvest)
library(xml2)
library(selectr)
library(stringr)
library(jsonlite)
url_name <- 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=apex+predator+conservation&btnG=&oq=apex+predator+c'
wp <- xml2::read_html(url_name)
# Extract raw data
titles <- rvest::html_text(rvest::html_nodes(wp, '.gs_rt'))
authors_years <- rvest::html_text(rvest::html_nodes(wp, '.gs_a'))
# Process data
authors <- gsub('^(.*?)\\W+-\\W+.*', '\\1', authors_years, perl = TRUE)
years <- gsub('^.*(\\d{4}).*', '\\1', authors_years, perl = TRUE)
leftovers <- authors_years %>%
str_remove_all(authors) %>%
str_remove_all(years)
journals <- str_split(leftovers, "-") %>%
map_chr(2) %>%
str_extract_all("[:alpha:]*") %>%
map(function(x) x[x != ""]) %>%
map(~paste(., collapse = " ")) %>%
unlist()
# Make data frame
df <- data.frame(titles = titles, authors = authors, years = years, journals = journals, stringsAsFactors = FALSE)
For your second question: the css selector gadget chrome extension is nice for getting the css selectors of the elements you want. But in your case all elements share the same css class, so the only way to disentangle them is to use regex. So I guess learn a bit about css selectors and regex :)
Related
I need to scrap this webpage so I could have a data.frame like this:
value01 value02 id
SECTION I LIVE ANIMALS ANIMAL PRODUCTS sectionI
CHAPTER 1 LIVE ANIMALS chap0100000000
0101 Live horses, asses, mules and hinnies : (TN701) 0101000000-1
- Horses : 0101210000-2
0101 21 - - Pure-bred breeding animals (NC018) 0101210000-80
0101 29 - - Other : 0101290000-3
0101 29 10 - - - For slaughter 0101291000-80
0101 29 90 - - - Other 0101299000-80
0101 30 - Asses 0101300000-80
To obtain the first two rows of value01 and value02 I use:
unlist((remDr$getPageSource()[[1]] %>% read_html(encoding = 'UTF-8') %>% html_elements('.section') %>% html_table())[2])
unlist((remDr$getPageSource()[[1]] %>% read_html(encoding = 'UTF-8') %>% html_elements('.chapter') %>% html_table())[2])
To obtain the rest of values of value01 and value02 I use (I need to clean the obtained values after I got them with this code, but I think there is better way to obtain the data):
remDr$getPageSource()[[1]] %>% read_html() %>% html_element(xpath = '//*[#id="div_description"]') %>% html_table()
So my problem now is to get the id column of the data.frame I want and to put it all together. Any advice on how to proceed from here to achieve my goal?
The code you need to run to function the previous examples:
suppressMessages(suppressWarnings(library(RSelenium)))
suppressMessages(suppressWarnings(library(rvest)))
rD <- rsDriver(browser = 'firefox', port = 6000L, verbose = FALSE)
remDr <- rD[['client']]
remDr$navigate('https://ec.europa.eu/taxation_customs/dds2/taric/measures.jsp?Lang=en&Domain=TARIC&Offset=0&ShowMatchingGoods=false&callbackuri=CBU-1&SimDate=20220719')
It is not quite clear to me what you want to scrape exactly from that page, but this is how you can get the data I think you are after.
pg <- remDr$getPageSource()[[1]]
doc <- xml2::read_html(pg)
# first two lines
rvest::html_elements(doc, '#sectionI table , .chapter') |>
rvest::html_table()
# get the data from each further line
lines <- rvest::html_elements(doc, ".evenLine")
data <- rvest::html_table(lines)
ids <- rvest::html_attrs(lines) |> sapply(function(x) x[1])
You'll need to clean the scraped data to your liking.
If this is not what you are looking for, you should clarify your question further.
I am new in web scraping with r and I am trying to get a daily updated object which is probably not text. The url is
here and I want to extract the daily situation table in the end of the page. The class of this object is
class="aem-GridColumn aem-GridColumn--default--12 aem-GridColumn--offset--default--0"
I am not really experienced with html and css so if you have any useful source or advice on how I can extract objects from a webpage I would really appreciate it, since SelectorGadget in that case indicate "No valid path found."
Without getting into the business of writing web scrapers, I think this should help you out:
library(rvest)
url = 'https://covid19.public.lu/en.html'
source = read_html(url)
selection = html_nodes( source , '.cmp-gridStat__item-container' ) %>% html_node( '.number' ) %>% html_text() %>% toString()
We can convert the text obtained from Daily situation update using vroom package
library(rvest)
library(vroom)
url = 'https://covid19.public.lu/en.html'
df = url %>%
read_html() %>%
html_nodes('.cmp-gridStat__item-container') %>%
html_text2()
vroom(df, delim = '\\n', col_names = F)
# A tibble: 22 x 1
X1
<chr>
1 369 People tested positive for COVID-19
2 Per 100.000 inhabitants: 58,13
3 Unvaccinated: 91,20
Edit:
html_element vs html_elemnts
The pout of html_elemnts (html_nodes) is,
[1] "369 People tested positive for COVID-19\n\nPer 100.000 inhabitants: 58,13\n\nUnvaccinated: 91,20\n\nVaccinated: 41,72\n\nRatio Unvaccinated / Vaccinated: 2,19\n\n "
[2] "4 625 Number of PCR tests performed\n\nPer 100.000 inhabitants: 729\n\nPositivity rate in %: 7,98\n\nReproduction rate: 0,97"
[3] "80 Hospitalizations\n\nNormal care: 57\nIntensive care: 23\n\nNew deaths: 1\nTotal deaths: 890"
[4] "6 520 Vaccinations per day\n\nDose 1: 785\nDose 2: 468\nComplementary dose: 5 267"
[5] "960 315 Total vaccines administered\n\nDose 1: 452 387\nDose 2: 395 044\nComplementary dose: 112 884"
and that of html_element (html_node)` is
[1] "369 People tested positive for COVID-19\n\nPer 100.000 inhabitants: 58,13\n\nUnvaccinated: 91,20\n\nVaccinated: 41,72\n\nRatio Unvaccinated / Vaccinated: 2,19\n\n "
As you can see html_nodes returns all value associated with the nodes whereashtml_node only returns the first node. Thus, the former fetches you all the nodes which is really helpful.
html_text vs html_text2
The html_text2retains the breaks in strings usually \n and \b. These are helpful when working with strings.
More info is in rvest documentation,
https://cran.r-project.org/web/packages/rvest/rvest.pdf
There is probably a much more elegant way to do this efficiently, but when I need brute force something like this, I try to break it down into small parts.
Use the httr library to get the raw html.
Use str_extract from the stringr library to extract the specific piece of data from the html.
I use both a positive lookbehind and lookahead regex to get the exact piece of data I need. It basically takes the form of "?<=text_right_before).+?(?=text_right_after)
library(httr)
library(stringr)
r <- GET("https://covid19.public.lu/en.html")
html<-content(r, "text")
normal_care=str_extract(html, regex("(?<=Normal care: ).+?(?=<br>)"))
intensive_care=str_extract(html, regex("(?<=Intensive care: ).+?(?=</p>)"))
I wondered if you could get the same data from any of their public APIs. If you simply want a pdf with that table (plus lots of other tables of useful info) you can use the API to extract.
If you want as a DataFrame (resembling as per webpage) you can write a user defined function, with the help of pdftools, to reconstruct the table from the pdf. Bit more effort but as you already have other answers covering using rvest thought I'd have a look at this. I looked at tabularize but that wasn't particularly effective.
More than likely, you could pull several of the API datasets together to get the full content without the need to parse the pdf publication I use e.g. there is an Excel spreadsheet that gives the case numbers.
N.B. There are a few bottom calcs from the webpage not included below. I have only processed the testing info table from the pdf.
Rapports journaliers:
https://data.public.lu/en/datasets/covid-19-rapports-journaliers/#_
https://download.data.public.lu/resources/covid-19-rapports-journaliers/20211210-165252/coronavirus-rapport-journalier-10122021.pdf
API datasets:
https://data.public.lu/api/1/datasets/#
library(tidyverse)
library(jsonlite)
## https://data.library.virginia.edu/reading-pdf-files-into-r-for-text-mining/
# install.packages("pdftools")
library(pdftools)
r <- jsonlite::read_json("https://data.public.lu/api/1/datasets/#")
report_index <- match(TRUE, map(r$data, function(x) x$slug == "covid-19-rapports-journaliers"))
latest_daily_covid_pdf <- r$data[[report_index]]$resources[[1]]$latest # coronavirus-rapport-journalier
filename <- "covd_daily.pdf"
download.file(latest_daily_covid_pdf, filename, mode = "wb")
get_latest_daily_df <- function(filename) {
data <- pdf_text(filename)
text <- data[[1]] %>% strsplit(split = "\n{2,}")
web_data <- text[[1]][3:12]
df <- map(web_data, function(x) strsplit(x, split = "\\s{2,}")) %>%
unlist() %>%
matrix(nrow = 10, ncol = 5, byrow = T) %>%
as_tibble()
colnames(df) <- text[[1]][2] %>%
strsplit(split = "\\s{2,}") %>%
map(function(x) gsub("(.*[a-z])\\d+", "\\1", x)) %>%
unlist()
title <- text[[1]][1] %>%
strsplit(split = "\n") %>%
unlist() %>%
tail(1) %>%
gsub("\\s+", " ", .) %>%
gsub(" TOTAL", "", .)
colnames(df)[2:3] <- colnames(df)[2:3] %>% paste(title, ., sep = " ")
colnames(df)[4:5] <- colnames(df)[4:5] %>% paste("TOTAL", ., sep = " ")
colnames(df)[1] <- "Metric"
clean_col <- function(x) {
gsub("\\s+|,", "", x) %>% as.numeric()
}
clean_col2 <- function(x) {
gsub("\n", " ", gsub("([a-z])(\\d+)", "\\1", x))
}
df <- df %>% mutate(across(.cols = -c(colnames(df)[1]), clean_col),
Metric = clean_col2(Metric)
)
return(df)
}
View(get_latest_daily_df(filename))
Output:
Alternate:
If you simply want to pull items then process you could extract each column as an item in a list. Replace br elements such that the content within those end up in a comma separated list:
library(rvest)
library(magrittr)
library(stringi)
library(xml2)
page <- read_html("https://covid19.public.lu/en.html")
xml_find_all(page, ".//br") %>% xml_add_sibling("span", ",") #This method from https://stackoverflow.com/a/46755666 #hrbrmstr
xml_find_all(page, ".//br") %>% xml_remove()
columns <- page %>% html_elements(".cmp-gridStat__item")
map(columns, ~ .x %>%
html_elements("p") %>%
html_text(trim = T) %>%
gsub("\n\\s{2,}", " ", .)
%>%
stri_remove_empty())
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I collected a series of URLs similar to this one. For each URL, I am using the rvest package to web-scrape information related to the address of every practitioner listed in each box of the webpage. By inspecting the HTML structure of the webpage, I could notice that the information I am trying to retrieve is present inside the HTML division called unit size1of2 (which appears, by hovering with the cursor, as div.unit.size1of2). Then, I used the following code to extract the information I need:
library(rvest)
library(xlm2)
webpage <- read_html(x = "myURL")
webpage_name <- webpage %>%
html_nodes("div.unit.size1of2") %>%
html_text(trim = T)
However, when I extract the information, the result I get it's super messy. First of all, there are information I didn't want to scrape, some of them seems to not even be present on the website. In addition, my RStudio IDE freezes for a while, and every time I try to output the result, without working properly afterwards with any command. Finally, the result is not the one I was looking for.
Do you think this is due to some kind of protection present on the website?
Thank you for your help!
You can start iterating on rows which can be selected using div.search-result .line and then :
getting the name using div:first-child h3
getting the ordinal using div:first-child p
getting the location by iterating on div:nth-child(2) p since there can be multiple locations (one has 5 locations on your page) and store them in a list
It's necessary to remove the tabs and new lines using gsub("[\t\n]", "", x) for the name and ordinal. For the addresses, you can get the text and split according to new line \n, remove duplicates new line and strip the first and last line to have a list like :
[1] "CABINET VÉTÉRINAIRE DV FEYS JEAN-MARC"
[2] "Cabinet Veterinaire"
[3] "ZA de Kercadiou"
[4] "XXXXX"
[5] "LANVOLLON"
[6] "Tél : 0X.XX.XX.XX.XX"
The following code also converts the list of vectors to a dataframe with all the data on that page :
library(rvest)
library(plyr)
url = "https://www.veterinaire.fr/annuaires/trouver-un-veterinaire-pour-soigner-mon-animal.html?tx_siteveterinaire_general%5B__referrer%5D%5B%40extension%5D=SiteVeterinaire&tx_siteveterinaire_general%5B__referrer%5D%5B%40vendor%5D=SiteVeterinaire&tx_siteveterinaire_general%5B__referrer%5D%5B%40controller%5D=FrontendUser&tx_siteveterinaire_general%5B__referrer%5D%5B%40action%5D=search&tx_siteveterinaire_general%5B__referrer%5D%5Barguments%5D=YToxOntzOjY6InNlYXJjaCI7YTo1OntzOjM6Im5vbSI7czowOiIiO3M6NjoicmVnaW9uIjtzOjA6IiI7czoxMToiZGVwYXJ0ZW1lbnQiO3M6MDoiIjtzOjU6InZpbGxlIjtzOjA6IiI7czoxMjoiaXRlbXNQZXJQYWdlIjtzOjI6IjEwIjt9fQ%3D%3D21a1899f9a133814dfc1eb4e01b3b47913bd9925&tx_siteveterinaire_general%5B__referrer%5D%5B%40request%5D=a%3A4%3A%7Bs%3A10%3A%22%40extension%22%3Bs%3A15%3A%22SiteVeterinaire%22%3Bs%3A11%3A%22%40controller%22%3Bs%3A12%3A%22FrontendUser%22%3Bs%3A7%3A%22%40action%22%3Bs%3A6%3A%22search%22%3Bs%3A7%3A%22%40vendor%22%3Bs%3A15%3A%22SiteVeterinaire%22%3B%7D7cd75ca141359a98763248c24da8103293a53d08&tx_siteveterinaire_general%5B__trustedProperties%5D=a%3A1%3A%7Bs%3A6%3A%22search%22%3Ba%3A5%3A%7Bs%3A3%3A%22nom%22%3Bi%3A1%3Bs%3A6%3A%22region%22%3Bi%3A1%3Bs%3A11%3A%22departement%22%3Bi%3A1%3Bs%3A5%3A%22ville%22%3Bi%3A1%3Bs%3A12%3A%22itemsPerPage%22%3Bi%3A1%3B%7D%7D86c9510d17c093c44d053714ab20567929a45f9d&tx_siteveterinaire_general%5Bsearch%5D%5Bnom%5D=&tx_siteveterinaire_general%5Bsearch%5D%5Bregion%5D=&tx_siteveterinaire_general%5Bsearch%5D%5Bdepartement%5D=&tx_siteveterinaire_general%5Bsearch%5D%5Bville%5D=&tx_siteveterinaire_general%5Bsearch%5D%5BitemsPerPage%5D=100&tx_siteveterinaire_general%5B%40widget_0%5D%5BcurrentPage%5D=127&cHash=8d8dc78e004b4b9d0ecfdf9b884f54ca"
rows <- read_html(url) %>%
html_nodes("div.search-result .line")
strip <- function (x) gsub("[\t\n]", "", x)
i <- 1
data = list()
for(r in rows){
addresses = list()
j <- 1
locations = r %>% html_nodes("div:nth-child(2) p")
for(loc in locations){
addresses[[j]] <- loc %>% html_text() %>%
gsub("[\t]", "", .) %>% #remove tabs
gsub('([\n])\\1+', '\\1', .) %>% #remove duplicate \n
gsub('^\n|\n$', '', .) %>% #remove 1st and last \n
strsplit(., split='\n', fixed=TRUE) #split by \n
j <- j + 1
}
data[[i]] <- c(
name = r %>% html_nodes("div:first-child h3") %>% html_text() %>% strip(.),
ordinal = r %>% html_nodes("div:first-child p") %>% html_text() %>% strip(.),
addresses = addresses
)
i <- i + 1
}
df = rbind.fill(lapply(data,function(y){as.data.frame(t(y),stringsAsFactors=FALSE)}))
#show data
print(df)
for(i in 1:3){
print(paste("name",df[i,"name"]))
print(paste("ordinal",df[i,"ordinal"]))
print(paste("addresses",df[i,"addresses"]))
print(paste("addresses1",df[i,"addresses1"]))
print(paste("addresses2",df[i,"addresses2"]))
print(paste("addresses3",df[i,"addresses3"]))
}
I've been working on a webscraping project for the political science department at my university.
The Danish parliament is very transparent about their democratic process and they are uploading all the legislative documents on their website. I've been crawling over all pages starting 2008. Right now I'm parsing the information into a dataframe and I'm having an issue that I was not able to resolve so far.
If we look at the DOM we can see that they named most of the objects div.tingdok-normal. The number of objects varies between 16-19. To parse the information correctly for my dataframe I tried to grep out the necessary parts according to patterns. However, the issue is that sometimes my pattern match more than once and I don't know how to tell R that I only want the first match.
for the sake of an example I include some code:
final.url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
to.save <- getURL(final.url)
p <- read_html(to.save)
normal <- p %>% html_nodes("div.tingdok-normal > span") %>% html_text(trim =TRUE)
tomatch <- c("Forkastet regeringsforslag", "Forkastet privat forslag", "Vedtaget regeringsforslag", "Vedtaget privat forslag")
type <- unique (grep(paste(tomatch, collapse="|"), results, value = TRUE))
Maybe you can help me with that
My understanding is that you want to extract the text of the webpage, because the "tingdok-normal" are related to the text. I was able to get the text of the webpage with the following code. Also, the following code identifies the position of the first "regex hit" of the different patterns to match.
library(pagedown)
library(pdftools)
library(stringr)
pagedown::chrome_print("https://www.ft.dk/samling/20161/lovforslag/l154/index.htm",
"C:/.../danish.pdf")
text <- pdftools::pdf_text("C:/.../danish.pdf")
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}
Here is another approach :
library(RDCOMClient)
library(stringr)
library(rvest)
url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
html_Content <- doc$documentElement()$innerText()
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}
I´m currently trying to download a table from the following URL:
url1<-"http://iambweb.ams.or.at/ambweb/showcognusServlet?tabkey=3643193®ionDisplay=%C3%96sterreich&export=html&outputLocale=de"
I downloaded and saved the file as .xls because I thought it is a Excel-file with the following code:
temp <- paste0(tempfile(), ".xls")
download.file(url1, destfile = temp, mode = "wb")
First I tried to read it in R as a Excel file but it seems to be a html (can be read by Excel though):
dfAMS <- read_excel(path = temp, sheet = "Sheet1", range = "I7:I37")
Therefore:
df <- read_html(temp)
Now unfortunately I´m stuck because the following lines of code won´t give me the intended result (a nice table or at least column I7:I37 in the .xls):
dfAMS <- html_node(df, "table") %>% html_table(fill = T) %>% tibble::as_tibble()
dplyr::glimpse(df)
I´m pretty sure the solution is rather simple but I´m currently stuck and can´t find a solution...
Thanks in advance!
Klamsi, the url points to an html file renamed to have a ".xls" extension. This is somewhat common practice among webmasters. Try it yourself by renaming the ".xls" extention to ".html".
A second problem is that the html has a very messy table configuration. The table of interest is the fifth table in the document.
This is a workaround to obtain the values for the overall population (or "range A7:B37, I7:K37")
url <- "http://iambweb.ams.or.at/ambweb/showcognusServlet?tabkey=3643193®ionDisplay=%C3%96sterreich&export=html&outputLocale=en"
df <- read_html(url) %>%
html_table(header = TRUE, fill = TRUE) %>%
.[[5]] %>% #Extract the fifth table in the list
as.data.frame() %>%
.[,c(1:11)] %>%
select(1:2, 9:11)
names <- unlist(df[1,])
names[1:2] <- c("item", "Bundesland")
colnames(df) <- names
df <- df[-1,]
df %>% head()
item Bundesland Bestand Veränderung zum VJ absolut Veränderung zum VJ in %
2 Arbeitslosigkeit Bgld 7119 -973 -0.120242214532872
3 Arbeitslosigkeit Ktn 16564 -2160 -0.115359965819269
4 Arbeitslosigkeit NÖ 46342 -6095 -0.116234719758949
5 Arbeitslosigkeit OÖ 29762 -4649 -0.135102147569091
6 Arbeitslosigkeit Sbg 11173 -643 -0.0544177386594448
7 Arbeitslosigkeit Stmk 28677 -5602 -0.1634236704688