Saving the Text from a News Article in R? - html

I found this post over here that shows how to save the text from a website. Is there a simple way in R to extract only the text elements of an HTML page?.
I tried one of the answers provided here and it seems to be working quite well! For example:
library(htm2txt)
url_1 <- 'https://en.wikipedia.org/wiki/Alan_Turing'
text_1 <- gettxt(url_1)
url_2 <- 'https://www.bbc.com/future/article/20220823-how-auckland-worlds-most-spongy-city-tackles-floods'
text_2 <- gettxt(url_2)
All the text from the article appears, but so does a lot of "extra text" which does not have any meaning. For example:
p. 40/03B\n• ^ a or identifiers\n• Articles with GND identifiers\n• Articles with ICCU identifiers\n•
Is there some standard way to only keep the actual text from these articles? Or does this depend too much on the individual structure of the website and no "one size fits all" solution exists for such a problem?
Perhaps there might be some method of doing this in R that only recognizes the "actual text"?
Thank you!

You can cross-reference the words from the HTML page with a dictionary from qdapDictionaries, so only real English words are kept, but this method does keep words that aren't exclusively from the article (e.g., the word "jump" from "Jump to navigation").
library(tidyverse)
library(htm2txt)
library(quanteda)
library(qdapDictionaries)
data(DICTIONARY)
text <- 'https://en.wikipedia.org/wiki/Alan_Turing' %>% gettxt() %>% corpus()
text <- tokens(text, remove_punct = TRUE, remove_numbers = TRUE)
text <- tokens_select(text, DICTIONARY$word)
text <- data.frame(text = sapply(text, as.character), stringsAsFactors = FALSE) %>%
group_by(text1 = tolower(text)) %>%
table() %>% as.data.frame() %>%
rename(word = text1) %>%
rename(frequency = Freq)
head(text)

Related

Side-by-Side gt tables **WITH** footnotes

I am trying to create side-by-side gt tables, as the title suggests. I started with the very helpful answer found here: Arrange gt tables side by side or in a grid or table of tables. The key was to ouput the left and right tables as raw html (as_raw_html), then combine in a dataframe, then send back into gt and reformat as markdow (fmt_markdown).
However, I ran into a problem that I couldn't solve. The fmt_markdown command skips the footnote, so the resulting table has the raw html as a footnote.
I checked the documentation for gt, and the fmt_markdown command takes columns and rows as input - but, apparently, the footnote area is considered neither a column nor a row.
So the crux seems to be that I can't seem to find any way to target the footnote area for reformatting as mardown.
Below is a reproducible example.
library(tidyverse)
library(gt)
# Make a table with a footnote
tL <- exibble %>%
select(c(num, char, group)) %>%
gt() %>%
tab_footnote(
footnote = html("**I'm an apricot**"),
locations = cells_body(columns = char,
rows = char == "apricot")
) %>%
tab_style(style = cell_text(color = "blue"),
locations = cells_footnotes()) %>%
as_raw_html()
# Make a copy
tR <- tL
# Side-By-Side
SideBySide <- data.frame(Ltable = tL, Rtable = tR) %>%
gt() %>%
fmt_markdown(columns = everything())
And the result looks like this:
Created on 2022-02-20 by the reprex package (v2.0.1)

Web Scrape an Image with rvest R

I'm having a problem when trying to scrape an image from this page. My code is as follow:
library(rvest)
url <- read_html("https://covid-19vis.cmm.uchile.cl/chart")
m <- '/html/body/div/div/div[4]/main/div/div/div/div/div/div[2]/div[1]'
grafico_cmm <- html_node(url, xpath = m) %>% html_attr('src')
When I run the above code, the result is NA. Does someone know how can I scrape the plot or maybe the data from the page?
Thanks a lot
It not an image, it is an interactive chart. For an image, you would need to scrape the data points and re-create as a chart and then convert to an image. Xpath is also invalid.
The data comes from an API call. I checked the values against the chart and this is the correct endpoint.
library(jsonlite)
data <- jsonlite::read_json('https://covid-19vis.cmm.uchile.cl/api/data?scope=0&indicatorId=57', simplifyVector = T)
The chart needs some tidying but here is a basic plot of the r values:
data$date <- data$date %>% as.Date()
library("ggplot2")
ggplot(data=data,
aes(x=date, y=value, colour ='red')) +
geom_line() +
scale_color_discrete(name = "R Efectivo", labels = c("Chile"))
print tail(data)

RVEST - Extracting text from table - Problems with access to the right table

I would like to extract the values in the table on the top right side of this Webpage:
https://www.timeanddate.de/wetter/deutschland/karlsruhe/klima
(Wärmster Monat : VALUE, Kältester Monat: VALUE, Jahresniederschlag: VALUE)
Unfortunately, if I use html_nodes("Selectorgadgets result for the specific value"), I receive the values for the table on the top of the link:
https://www.timeanddate.de/stadt/info/deutschland/karlsruhe
(The webpages are similar, if you click "Uhrzeit/Übersicht" on the top bar, you access the second page and table, if you click "Wetter" --> "Klima", you access the first page/table (the one I want to extract values from!)
num_link= "https://www.timeanddate.de/wetter/deutschland/Karlsruhe/klima"
num_page= read_html(num_link)
rain_year = num_page %>% html_nodes("#climateTable > div.climate-month.climate-month--allyear > div:nth-child(3) > p:nth-child(1)") %>% html_text()
temp_warm = num_page %>% html_nodes("#climateTable > div.climate-month.climate-month--allyear > div:nth-child(2) > p:nth-child(1)") %>% html_text()
temp_cold = num_page %>% html_nodes("#climateTable > div.climate-month.climate-month--allyear > div:nth-child(2) > p:nth-child(1)") %>% html_text()
I get " character (empty) " for each variable . :(
THANK YOU IN ADVANCE!
You can use the html_table function in rvest, which is pretty good by now. Makes it a bit easier to extract, but I do recommend learning to identify the right css-selectors as well, as it does not always work. html_table always returns a list with all tables from the webpage, so in this case the steps are:
get the html
get the tables
index the right table (here there is only one)
reformat a little to extract the values
library(rvest)
library(tidyverse)
result <- read_html("https://www.timeanddate.de/wetter/deutschland/karlsruhe/klima") %>%
html_table() %>%
.[[1]] %>%
rename('measurement' = 1,
'original' = 2) %>%
mutate(value_num = str_extract_all(original,"[[:digit:]]+\\.*[[:digit:]]*") %>% unlist())

Why does rvest/html_table skip every second row in this table?

I am trying to scrape this table but for some reason every second row is skipped (which means I don't have data on half of the states). This is my code:
# read in the url
library(dplyr)
library(rvest)
webpage <- read_html ("https://oui.doleta.gov/unemploy/trigger/2013/trig_010613.html")
df <- webpage %>%
html_node("table") %>%
html_table(fill = TRUE)
Does anyone have any ideas why this is? The only thing I can think of is that every second row has background colour specified?
Thanks :)
see if this helps
webpage <- read_html("https://oui.doleta.gov/unemploy/trigger/2013/trig_010613.html")
df <- webpage %>%
html_nodes(xpath = '/html/body/table') %>%
html_table(fill = TRUE)
d = as.data.frame(df)

Scraping html table and its href Links in R

I am trying to download a table that contains text and links. I can successfully download the table with the link text "Pass". However, instead of the text, I would like to capture the actual href URL.
library(dplyr)
library(rvest)
library(XML)
library(httr)
library(stringr)
link <- "http://www.qimedical.com/resources/method-suitability/"
qi_webpage <- read_html(link)
qi_table <- html_nodes(qi_webpage, 'table')
qi <- html_table(qi_table, header = TRUE)[[1]]
qi <- qi[,-1]
Above gives a nice dataframe. However the last column only contains the text "Pass" when I would like to have the link associated with it. I have tried to use the following to add the links, but they do not correspond to the correct
row:
qi_get <- GET("http://www.qimedical.com/resources/method-suitability/")
qi_html <- htmlParse(content(qi_get, as="text"))
qi.urls <- xpathSApply(qi_html, "//*/td[7]/a", xmlAttrs, "href")
qi.urls <- qi.urls[1,]
qi <- mutate(qi, "MSTLink" = (ifelse(qi$`Study Protocol(click to download certification)` == "Pass", (t(qi.urls)), "")))
I know little about html, css, etc, so I am not sure what I am missing to accomplish this properly.
Thanks!!
You're looking for a elements inside of table cells, td. Then you want the value of the href attribute. So here's one way, which will return a vector with all the URLs for the PDF downloads:
qi_webpage %>%
html_nodes(xpath = "//td/a") %>%
html_attr("href")