Turning a table in HTML into a data frame - html

I'm trying my hand at scraping tables from Wikipedia and I'm reaching an impasse. I'm using the squads of the FIFA 2014 World Cup as an example. In this case, I want to extract the list of the participating countries from the table of the contents from the page "2014 FIFA World Cup squads" and store them as a vector. Here's how far I got:
library(tidyverse)
library(rvest)
library(XML)
library(RCurl)
(Countries <- read_html("https://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads") %>%
html_node(xpath = '//*[#id="toc"]/ul') %>%
htmlTreeParse() %>%
xmlRoot())
This spits out a bunch of HTML code that I won't copy/paste here. I specifically am looking to extract all lines with the tag <span class="toctext"> such as "Group A", "Brazil", "Cameroon", etc. and have them saved as a vector. What function would make this happen?

You can read the text from a node using html_text()
url <- "https://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads"
toc <- url %>%
read_html() %>%
html_node(xpath = '//*[#id="toc"]') %>%
html_text()
This gives you a single character vector. You can then split on the \n character to give you the results as a vector (and you can clean out the blanks)
contents <- strsplit(toc, "\n")[[1]]
contents[contents != ""]
# [1] "Contents" "1 Group A" "1.1 Brazil"
# [4] "1.2 Cameroon" "1.3 Croatia" "1.4 Mexico"
# [7] "2 Group B" "2.1 Australia" "2.2 Chile"
# [10] "2.3 Netherlands" "2.4 Spain" "3 Group C"
# [13] "3.1 Colombia" "3.2 Greece" "3.3 Ivory Coast"
# [16] "3.4 Japan" "4 Group D" "4.1 Costa Rica"
# [19] "4.2 England" "4.3 Italy" "4.4 Uruguay"
# ---
# etc
Generally, to read tables in an html document you can use the html_table() function, but in this case the table of contents isn't read.
url %>%
read_html() %>%
html_table()

Related

Reading HTML into an R data frame using rvest

I am trying to scrape data from https://homicides.news.baltimoresun.com/recent/ using rvest and put information on victims into a data table or frame.
What I have so far is:
html <- read_html(x = "https://homicides.news.baltimoresun.com/recent/")
html_node(html, ".recentvictims") %>%
html_children() %>%
head() %>%
html_text2()
which gives me a list of the information, but I can't find a way to put this into a data frame.
[1] "Date & time\nVictim name\nAddress\nAge\nGender\nRace"
[2] "09/26/2022 7:15 p.m.\n\n1900 Griffis Ave\n—\nMale\nUnknown"
[3] "09/21/2022 1:45 p.m.\nKelly Logan\n2100 Kloman St\n53\nFemale\nBlack"
[4] "09/20/2022 9:00 a.m.\nDelon Bushrod\n2800 Bookert Dr\n24\nMale\nBlack"
[5] "09/19/2022 8:06 p.m.\nTerry Gordon\n1600 N Wolfe St\n53\nMale\nBlack"
[6] "09/16/2022 9:43 a.m.\nDelanie McCloud\n100 Wilmott Court\n37\nMale\nBlack"
I've also tried selecting the html elements under ".recentelements"
minimal_html(html) %>%
html_element(".recentvictims")
which gives me:
[1] <div class="lfrow">\n <div class="lfdate">\n <b>Date & time\n </div>\n ...
[2] <div class="lfrow odd">\n <div class="lfdate">\n <a href="/victim/4597/">\n ...
[3] <div class="lfrow even">\n <div class="lfdate">\n <a href="/victim/4595/">\n ...
I want to grab all the info under classes "lfrow even" and "lfrow odd"
Any suggestions? Thank you
To get your output into a data frame, I added as.data.frame() to your first piece of code, which created a data frame with one column named . and all the text separated by line breaks \n. I used the tidyr function separate() to convert this data into columns. To get the column names I used the strsplit() function to separate first row of data into a character vector. (This function produces a list, so the [[1]] extracts the first element of that list which is the required vector of column names.)
library(rvest)
library(tidyr)
library(dplyr)
html <- read_html(x = "https://homicides.news.baltimoresun.com/recent/")
data <- html_node(html, ".recentvictims") %>%
html_children() %>%
head() %>%
html_text2() %>%
as.data.frame
want <- data %>%
filter(row_number()>1) %>% # first row has column names
separate(col='.',sep="\\n",into=strsplit(data[1,1],'\\n')[[1]])

Scrape object from html with rvest

I am new in web scraping with r and I am trying to get a daily updated object which is probably not text. The url is
here and I want to extract the daily situation table in the end of the page. The class of this object is
class="aem-GridColumn aem-GridColumn--default--12 aem-GridColumn--offset--default--0"
I am not really experienced with html and css so if you have any useful source or advice on how I can extract objects from a webpage I would really appreciate it, since SelectorGadget in that case indicate "No valid path found."
Without getting into the business of writing web scrapers, I think this should help you out:
library(rvest)
url = 'https://covid19.public.lu/en.html'
source = read_html(url)
selection = html_nodes( source , '.cmp-gridStat__item-container' ) %>% html_node( '.number' ) %>% html_text() %>% toString()
We can convert the text obtained from Daily situation update using vroom package
library(rvest)
library(vroom)
url = 'https://covid19.public.lu/en.html'
df = url %>%
read_html() %>%
html_nodes('.cmp-gridStat__item-container') %>%
html_text2()
vroom(df, delim = '\\n', col_names = F)
# A tibble: 22 x 1
X1
<chr>
1 369 People tested positive for COVID-19
2 Per 100.000 inhabitants: 58,13
3 Unvaccinated: 91,20
Edit:
html_element vs html_elemnts
The pout of html_elemnts (html_nodes) is,
[1] "369 People tested positive for COVID-19\n\nPer 100.000 inhabitants: 58,13\n\nUnvaccinated: 91,20\n\nVaccinated: 41,72\n\nRatio Unvaccinated / Vaccinated: 2,19\n\n "
[2] "4 625 Number of PCR tests performed\n\nPer 100.000 inhabitants: 729\n\nPositivity rate in %: 7,98\n\nReproduction rate: 0,97"
[3] "80 Hospitalizations\n\nNormal care: 57\nIntensive care: 23\n\nNew deaths: 1\nTotal deaths: 890"
[4] "6 520 Vaccinations per day\n\nDose 1: 785\nDose 2: 468\nComplementary dose: 5 267"
[5] "960 315 Total vaccines administered\n\nDose 1: 452 387\nDose 2: 395 044\nComplementary dose: 112 884"
and that of html_element (html_node)` is
[1] "369 People tested positive for COVID-19\n\nPer 100.000 inhabitants: 58,13\n\nUnvaccinated: 91,20\n\nVaccinated: 41,72\n\nRatio Unvaccinated / Vaccinated: 2,19\n\n "
As you can see html_nodes returns all value associated with the nodes whereashtml_node only returns the first node. Thus, the former fetches you all the nodes which is really helpful.
html_text vs html_text2
The html_text2retains the breaks in strings usually \n and \b. These are helpful when working with strings.
More info is in rvest documentation,
https://cran.r-project.org/web/packages/rvest/rvest.pdf
There is probably a much more elegant way to do this efficiently, but when I need brute force something like this, I try to break it down into small parts.
Use the httr library to get the raw html.
Use str_extract from the stringr library to extract the specific piece of data from the html.
I use both a positive lookbehind and lookahead regex to get the exact piece of data I need. It basically takes the form of "?<=text_right_before).+?(?=text_right_after)
library(httr)
library(stringr)
r <- GET("https://covid19.public.lu/en.html")
html<-content(r, "text")
normal_care=str_extract(html, regex("(?<=Normal care: ).+?(?=<br>)"))
intensive_care=str_extract(html, regex("(?<=Intensive care: ).+?(?=</p>)"))
I wondered if you could get the same data from any of their public APIs. If you simply want a pdf with that table (plus lots of other tables of useful info) you can use the API to extract.
If you want as a DataFrame (resembling as per webpage) you can write a user defined function, with the help of pdftools, to reconstruct the table from the pdf. Bit more effort but as you already have other answers covering using rvest thought I'd have a look at this. I looked at tabularize but that wasn't particularly effective.
More than likely, you could pull several of the API datasets together to get the full content without the need to parse the pdf publication I use e.g. there is an Excel spreadsheet that gives the case numbers.
N.B. There are a few bottom calcs from the webpage not included below. I have only processed the testing info table from the pdf.
Rapports journaliers:
https://data.public.lu/en/datasets/covid-19-rapports-journaliers/#_
https://download.data.public.lu/resources/covid-19-rapports-journaliers/20211210-165252/coronavirus-rapport-journalier-10122021.pdf
API datasets:
https://data.public.lu/api/1/datasets/#
library(tidyverse)
library(jsonlite)
## https://data.library.virginia.edu/reading-pdf-files-into-r-for-text-mining/
# install.packages("pdftools")
library(pdftools)
r <- jsonlite::read_json("https://data.public.lu/api/1/datasets/#")
report_index <- match(TRUE, map(r$data, function(x) x$slug == "covid-19-rapports-journaliers"))
latest_daily_covid_pdf <- r$data[[report_index]]$resources[[1]]$latest # coronavirus-rapport-journalier
filename <- "covd_daily.pdf"
download.file(latest_daily_covid_pdf, filename, mode = "wb")
get_latest_daily_df <- function(filename) {
data <- pdf_text(filename)
text <- data[[1]] %>% strsplit(split = "\n{2,}")
web_data <- text[[1]][3:12]
df <- map(web_data, function(x) strsplit(x, split = "\\s{2,}")) %>%
unlist() %>%
matrix(nrow = 10, ncol = 5, byrow = T) %>%
as_tibble()
colnames(df) <- text[[1]][2] %>%
strsplit(split = "\\s{2,}") %>%
map(function(x) gsub("(.*[a-z])\\d+", "\\1", x)) %>%
unlist()
title <- text[[1]][1] %>%
strsplit(split = "\n") %>%
unlist() %>%
tail(1) %>%
gsub("\\s+", " ", .) %>%
gsub(" TOTAL", "", .)
colnames(df)[2:3] <- colnames(df)[2:3] %>% paste(title, ., sep = " ")
colnames(df)[4:5] <- colnames(df)[4:5] %>% paste("TOTAL", ., sep = " ")
colnames(df)[1] <- "Metric"
clean_col <- function(x) {
gsub("\\s+|,", "", x) %>% as.numeric()
}
clean_col2 <- function(x) {
gsub("\n", " ", gsub("([a-z])(\\d+)", "\\1", x))
}
df <- df %>% mutate(across(.cols = -c(colnames(df)[1]), clean_col),
Metric = clean_col2(Metric)
)
return(df)
}
View(get_latest_daily_df(filename))
Output:
Alternate:
If you simply want to pull items then process you could extract each column as an item in a list. Replace br elements such that the content within those end up in a comma separated list:
library(rvest)
library(magrittr)
library(stringi)
library(xml2)
page <- read_html("https://covid19.public.lu/en.html")
xml_find_all(page, ".//br") %>% xml_add_sibling("span", ",") #This method from https://stackoverflow.com/a/46755666 #hrbrmstr
xml_find_all(page, ".//br") %>% xml_remove()
columns <- page %>% html_elements(".cmp-gridStat__item")
map(columns, ~ .x %>%
html_elements("p") %>%
html_text(trim = T) %>%
gsub("\n\\s{2,}", " ", .)
%>%
stri_remove_empty())

Tabulizing data off website PDFs (w/ various formats) assigning each event values according to HTML link titles

I've been trying to automate the process of manually typing down the data from the ATF's trace data site (see "URL") but it's been a fairly big pain as I've only been able to collect the each URL link that holds the PDFs and assign it to its correct State/Territory and Year. The newer files 2017-2019 have data "tables that are relatively easier to pull data from compared to 2014-2016, e.g. I'm referring to page 10 of the Trace Data report 2019 and Trace Data report 2014.
It's the latter that I'm having trouble with the most as the data is not stored in something that looks like a table but surrounding a (!)pie-chart. There have been some promising R packages such as "pdftools" and "tesseract". But I'm very much an amateur when it comes to trouble-shooting advanced analytical packages such as these.
It's my guess that I'm still a ways off from where I want to be with the final product as I would need to mine the bottom text of page 10 to find how many "other" weapons were traced to a city, as well the number of weapons where a recovery city couldn't be determined. But if anyone has any suggestions on what I could try next or to even make the working code more efficient, I'd appreciate it.
URL <- "https://www.atf.gov/resource-center/data-statistics"
html <- paste(readLines(URL))
library(xml2)
library(tidyverse)
library(rvest)
library(stringr)
x <- c('\t\t\t<div>([^<]*)</div>','\t\t</tr><tr><td>([^<]*)</td>','\t\t\t<td>([^<]*)</td>')
r <- read_html(URL) %>% html_nodes("a") %>% map_df(~{
Link <- .x %>% html_attr("href")
Title <- .x %>% html_text()
data_frame(Link, Title)
}) %>%
dplyr::filter(grepl('node',Link, fixed = T))
r <- as.data.frame(r)
x <- c('<ul><li>([^<]*)</li>','\t<li>([^<]*)</li>')
states <- c('Alabama','Alaska','Arizona','Arkansas','California','Colorado','Connecticut','Delaware','District of Columbia','Florida','Georgia','Guam & Northern Mariana Islands','Hawaii','Idaho','Illinois','Indiana','Iowa','Kansas','Kentucky','Louisiana','Maine','Maryland','Massachusetts','Michigan','Minnesota','Mississippi','Missouri','Montana','Nebraska','Nevada','New Hampshire','New Jersey','New Mexico','New York','North Carolina','North Dakota','Ohio','Oklahoma','Oregon','Pennsylvania','Puerto Rico','Rhode Island','South Carolina','South Dakota','Tennessee','Texas','Utah','Vermont','Virginia','Washington','West Virginia','Wisconsin','Wyoming')
s <- list()
for(i in 1:nrow(r)){
s[[i]] <- read_html(r$Link[i]) %>% html_nodes("a") %>% map_df(~{
Link <- .x %>% html_attr("href")
Title <- .x %>% html_text()
data_frame(Link, Title)
}) %>% mutate(Year <- r$Title[i]) %>%
dplyr::filter(Title %in% states | str_detect(Title, "Virgin Islands")) %>%
dplyr::filter(grepl('download',Link, fixed = T))
trace_list = do.call(rbind, s)
}
names(trace_list)[3] <- "Year"
Progress so far...
library(pdftools)
pdf_file <- "https://www.atf.gov/file/146951/download"
text <- pdf_text(pdf_file)
cat(text[10])
vtext <- as.list(str_split(text[10],"\n"))
x <- data.frame(matrix(unlist(vtext), nrow=length(vtext), byrow=TRUE),stringsAsFactors=FALSE)
x1 <- pivot_longer(x, cols = 1:length(x),names_to="X1",values_to="X2")
x1$X2 <- trimws(x1$X2)
x1 <- x1[c(8,12),]
x1[1,2] <- sub(" ","_",x1[1,2],fixed=T)
library(splitstackshape)
x1 <- as.data.frame(cSplit(x1, 'X2', sep=" ", type.convert=FALSE))
x1 <- x1[,c(2:length(x1))]
colnames(x1) <- x1[1,]
x1 <- x1[-1, ]
x2 <- pivot_longer(x1, cols = 1:length(x1),names_to="city",values_to="count")
mixing both pdftools amd tesseract...
library(tesseract)
img_file <- pdftools::pdf_convert("https://www.atf.gov/file/89621/download", format = 'tiff', dpi = 400)
text <- ocr(img_file)
strsplit(text[10],"\n")
Expected output:
year
state
city
count
2019
AL
Birmingham
100
2018
CA
Los Angeles
200
2017
CA
None
30
2017
CA
Other
400

How can I time scraping news stories from a list of urls with R?

I am trying to download the text of newspaper articles for textual analysis using R. I have a large list of urls to individual articles and want to use Rvest to extract each of these articles' text and title and convert it into a data frame.
As an example, I have a subset of my dataset with articles from The Guardian:
> items$link[1:8]
[1] "https://www.theguardian.com/uk-news/2019/nov/16/concerns-raised-cladding-bolton-student-building-fire"
[2] "https://www.theguardian.com/uk-news/2019/nov/16/top-lawyer-calls-prince-andrew-bbc-interview-catastrophic-error"
[3] "https://www.theguardian.com/politics/live/2019/nov/16/general-election-labour-meet-decide-manifesto-clause-v-live-news"
[4] "https://www.theguardian.com/politics/2019/nov/16/priti-patel-block-rescue-british-isis-children"
[5] "https://www.theguardian.com/politics/2019/nov/16/police-assessing-claims-that-tories-offered-peerages-to-brexit-party"
[6] "https://www.theguardian.com/world/2019/nov/16/paris-police-fire-teargas-on-anniversary-of-gilets-jaunes-protests"
[7] "https://www.theguardian.com/us-news/2019/nov/16/trump-personally-kept-pressure-ukraine-impeachment-inquiry-witness-david-holmes-diplomat"
[8] "https://www.theguardian.com/world/2019/nov/16/hong-kong-chinese-troops-deployed-to-help-clear-roadblocks"
My code so far is:
## SETUP ##
rm(list=ls())
library(tidyverse)
library(rvest)
library(stringr)
library(readtext)
library(quanteda)
library(beepr)
setwd("uk")
## Functions ##
parse_texts <- function(nod){
body <- str_squish(as.character(nod) %>% read_html() %>%
html_nodes('.js-article__body > p') %>% #collects all text in article
html_text())
one_body <- paste(body, collapse = " ") # puts all of the text together
data.frame(title = str_squish(nod %>% read_html() %>%
html_node('.content__headline') %>%
html_text()),
date_time = str_squish(nod %>% read_html() %>%
html_node('.content__dateline-wpd--modified') %>%
html_text()),
text = one_body,
stringsAsFactors = FALSE)
}
#extract file text
test_df <- lapply(items$link[1:5], parse_texts) %>% bind_rows()
This works, for the most part. My problem is that I have thousands of urls in my data. How can I automate a script that will slowly work through this list?
Thanks to Dave2e for answering the question.
I added Sys.sleep(2) to the parse_texts function and was able to go through my list of URLs.

When scraping with rvest expected html_node not appearing

The ITTO website produces a table of timber products and flows directly under the search form once the query is submitted (on the same page). Using information I obtained from Chrome's SelectorGadget I'm expecting the table to appear as the css element "td". Using rvest to scrape information on Albania for 2014...
library(rvest)
session <- html_session("http://www.itto.int/annual_review_output/?mode=searchdata")
form <- html_form(session)[[2]]
form <- set_values(form, "countries[]" = "8", "products[]" = "1" ,"flows[]" = "1", "years[]" = "2014")
query <- submit_form(session, form, submit = NULL)
page <- read_html(query) %>% html_nodes("td")
page
Which results in the table "td" being absent:
{xml_nodeset (0)}
Examining other elements of the page with html_nodes() suggests that submit_form() performed otherwise as expected.
So my question is where is the expected table?
It might be easier (in the long run) to scrape the select box options and just feed the POST call directly:
library(httr)
library(rvest)
res <- POST(url = "http://www.itto.int/annual_review_output/?mode=searchdata",
body = list(`countries[]` = "76",
`products[]` = "1", `flows[]` = "1",
`years[]` = "2014"),
encode = "form")
pg <- content(res, as="parsed")
html_nodes(pg, "td")
## {xml_nodeset (7)}
## [1] <td>Brazil</td>
## [2] <td>Ind. roundwood</td>
## [3] <td>Exports Quantity</td>
## [4] <td>1000 m3</td>
## [5] <td>2014</td>
## [6] <td style="text-align:right;">204.59</td>
## [7] <td>I</td>