Read HTML into R - html

I would like R to take a word in a column in a dataset, and return a value from a website. The code I have so far is below. So, for each word in the data frame column, it will go to the website and return the pronunciation (for example, the pronunciation on http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=word&stress=-s is "W ER1 D"). I have looked at the HTML of the website, and it's unclear what I would need to enter to return this value - it's between <tt> and </tt> but there are many of these. I'm also not sure how to then get that value into R. Thank you.
library(xml2)
for (word in df$word) {
result <- read_html("http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in="word"&stress=-s")
}

Parsing HTML is a tricky task in R. There are a couple ways though. If the HTML converts well to XML and the website/API always returns the same structure then you can use tools to parse XML. Otherwise you could use regex and call stringr::str_extract() on the HTML.
For your case, it is fairly easy to get the value you're looking for using XML tools. It's true that there are a lot of <tt> tags but the one you want is always in the second instance so you can just pull out that one.
#load packages. dplyr is just to use the pipe %>% function
library(httr)
library(XML)
library(dplyr)
#test words
wordlist = c('happy', 'sad')
for (word in wordlist){
#build the url and GET the result
url <- paste0("http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=",word,"&stress=-s")
h <- handle(url)
res <- GET(handle = h)
#parse the HTML
resXML <- htmlParse(content(res, as = "text"))
#retrieve second <tt>
print(getNodeSet(resXML, '//tt[2]') %>% sapply(., xmlValue))
#don't abuse your API
Sys.sleep(0.1)
}
>[1] "HH AE1 P IY0 ."
>[1] "S AE1 D ."
Good luck!
EDIT: This code will return a dataframe:
#load packages. dplyr is just to use the pipe %>% function
library(httr)
library(XML)
library(dplyr)
#test words
wordlist = c('happy', 'sad')
#initializae the dataframe with pronunciation field
pronunciation_list <- data.frame(pronunciation=character(),stringsAsFactors = F)
#loop over the words
for (word in wordlist){
#build the url and GET the result
url <- paste0("http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=",word,"&stress=-s")
h <- handle(url)
res <- GET(handle = h)
#parse the HTML
resXML <- htmlParse(content(res, as = "text"))
#retrieve second <tt>
to_add <- data.frame(pronunciation=(getNodeSet(resXML, '//tt[2]') %>% sapply(., xmlValue)))
#bind the data
pronunciation_list<- rbind(pronunciation_list, to_add)
#don't abuse your API
Sys.sleep(0.1)
}

Related

R - Issue with the DOM of the danish parliament (webscraping)

I've been working on a webscraping project for the political science department at my university.
The Danish parliament is very transparent about their democratic process and they are uploading all the legislative documents on their website. I've been crawling over all pages starting 2008. Right now I'm parsing the information into a dataframe and I'm having an issue that I was not able to resolve so far.
If we look at the DOM we can see that they named most of the objects div.tingdok-normal. The number of objects varies between 16-19. To parse the information correctly for my dataframe I tried to grep out the necessary parts according to patterns. However, the issue is that sometimes my pattern match more than once and I don't know how to tell R that I only want the first match.
for the sake of an example I include some code:
final.url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
to.save <- getURL(final.url)
p <- read_html(to.save)
normal <- p %>% html_nodes("div.tingdok-normal > span") %>% html_text(trim =TRUE)
tomatch <- c("Forkastet regeringsforslag", "Forkastet privat forslag", "Vedtaget regeringsforslag", "Vedtaget privat forslag")
type <- unique (grep(paste(tomatch, collapse="|"), results, value = TRUE))
Maybe you can help me with that
My understanding is that you want to extract the text of the webpage, because the "tingdok-normal" are related to the text. I was able to get the text of the webpage with the following code. Also, the following code identifies the position of the first "regex hit" of the different patterns to match.
library(pagedown)
library(pdftools)
library(stringr)
pagedown::chrome_print("https://www.ft.dk/samling/20161/lovforslag/l154/index.htm",
"C:/.../danish.pdf")
text <- pdftools::pdf_text("C:/.../danish.pdf")
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}
Here is another approach :
library(RDCOMClient)
library(stringr)
library(rvest)
url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
html_Content <- doc$documentElement()$innerText()
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}

R highcharter get data from plots saved as html

I plot data with highcharter package in R, and save them as html to keep interactive features. In most cases I plot more than one graph, therefore bring them together as a canvas.
require(highcharter)
hc_list <- lapply(list(sin,cos,tan,tanh),mapply,seq(1,5,by = 0.1)) %>%
lapply(function(x) highchart() %>% hc_add_series(x))
hc_grid <- hw_grid(hc_list,ncol = 2)
htmltools::browsable(hc_grid) # print
htmltools::save_html(hc_grid,"test_grid.html") # save
I want to extract the data from plots that I have saved as html in the past, just like these. Normally I would do hc_list[[1]]$x$hc_opts$series, but when I import html into R and try to do the same, I get an error. It won't do the job.
> hc_imported <- htmltools::includeHTML("test_grid.html")
> hc_imported[[1]]$x$hc_opts$series
Error in hc_imported$x : $ operator is invalid for atomic vectors
If I would be able to write a function like
get_my_data(my_imported_highcharter,3) # get data from 3rd plot
it would be the best. Regards.
You can use below code
require(highcharter)
hc_list <- lapply(list(sin,cos,tan,tanh),mapply,seq(1,5,by = 0.1)) %>%
lapply(function(x) highchart() %>% hc_add_series(x))
hc_grid <- hw_grid(hc_list,ncol = 2)
htmltools::browsable(hc_grid) # print
htmltools::save_html(hc_grid,"test_grid.html") # save
# hc_imported <- htmltools::includeHTML("test_grid.html")
# hc_imported[[1]]$x$hc_opts$series
library(jsonlite)
library(RCurl)
library(XML)
get_my_data<-function(my_imported_highcharter,n){
webpage <- readLines(my_imported_highcharter)
pagetree <- htmlTreeParse(webpage, error=function(...){})
body <- pagetree$children$html$children$body
divbodyContent <- body$children$div$children[[n]]
script<-divbodyContent$children[[2]]
data<-as.character(script$children[[1]])[6]
data<-fromJSON(data,simplifyVector = FALSE)
data<-data$x$hc_opts$series[[1]]$data
return(data)
}
get_my_data("test_grid.html",3)
get_my_data("test_grid.html",1)

rvest cannot find node with xpath

This is the website I scapre
ppp projects
I want to use xpath to select the node like below
The xpath I get by use inspect element is "//*[#id="pppListUl"]/li1/div2/span2/span"
My scrpits are like below:
a <- html("http://www.cpppc.org:8082/efmisweb/ppp/projectLivrary/toPPPList.do")
b <- html_nodes(a, xpath = '//*[#id="pppListUl"]/li[1]/div[2]/span[2]/span')
b
Then I got the result
{xml_nodeset (0)}
Then I check the page source, I didn't even find anything about the project I selected.
I was wondering why I cannot find it in the page source, and in turn, how can I get the node by rvest.
It makes an XHR request for the content. Just work with that data (it's pretty clean):
library(httr)
POST('http://www.cpppc.org:8082/efmisweb/ppp/projectLivrary/getPPPList.do?tokenid=null',
encode="form",
body=list(queryPage=1,
distStr="",
induStr="",
investStr="",
projName="",
sortby="",
orderby="",
stageArr="")) -> res
content(res, as="text") %>%
jsonlite::fromJSON(flatten=TRUE) %>%
dplyr::glimpse()
(StackOverflow isn't advanced enough to let me post the output of that as it thinks it's spam).
It's a 4 element list with fields totalCount, list (which has the actual data), currentPage and totalPage.
It looks like you can change the queryPage form variable to iterate through the pages to get the whole list/database, something like:
library(httr)
library(purrr)
library(dplyr)
get_page <- function(page_num=1, .pb=NULL) {
if (!is.null(.pb)) pb$tick()$print()
POST('http://www.cpppc.org:8082/efmisweb/ppp/projectLivrary/getPPPList.do?tokenid=null',
encode="form",
body=list(queryPage=page_num,
distStr="",
induStr="",
investStr="",
projName="",
sortby="",
orderby="",
stageArr="")) -> res
content(res, as="text") %>%
jsonlite::fromJSON(flatten=TRUE) -> dat
dat$list
}
n <- 5 # change this to the value in `totalPage`
pb <- progress_estimated(n)
df <- map_df(1:n, get_page, pb)

How to read a <li> table in a webpage

I debug the program many times to get the result as follows:
url 研究所知识库列表
/handle/1471x/1 力学研究所
/handle/1471x/8865 半导体研究所
However, no metter what parameters I use, the result is not correct. The content in this table is one part of the basis of my further analysis, and I am very trembled for it. I'm looking forward to your help with great sincerity.
## download community-list ---the 1st level of IR Grid
#loading webpage and analyzing
community_url<-"http://www.irgrid.ac.cn/community-list"
com_source <- readLines(community_url, encoding = "UTF-8")
com_parsed <- htmlTreeParse(com_source, encoding = "UTF-8", useInternalNodes = TRUE)
# get table specs
tableNodes <- getNodeSet(com_parsed, "//table")
com_tb<-readHTMLTable(tableNodes[[8]], header=TRUE)
# get External links
xpath <- "//a/#href"
getHTMLExternalFiles(tableNodes[[8]], xpQuery = xpath)
it is unclear exactly what you want your end result to look like but if you modify your xpath statements a bit to take advantage of the DOM structure you can get something like this:
library(XML)
community_url<-"http://www.irgrid.ac.cn/community-list"
com_source <- readLines(community_url, encoding = "UTF-8")
com_parsed <- htmlTreeParse(com_source, encoding = "UTF-8", useInternalNodes = TRUE)
list_header <- xpathSApply(com_parsed, '//table[.//li]//h1', xmlValue)
hrefs <- xpathSApply(com_parsed, '//li[#class="communityLink"]//#href', function(x) unname(x))
display_text <- xpathSApply(com_parsed, '//li[#class="communityLink"]//a', xmlValue)
table_data <- cbind(display_text, hrefs)
colnames(table_data) <- c(list_header, "url")
table_data
console output causes stackoverflow to think this answer is spam but here is a screen shot:

Issues with readHTMLTable in R

I was trying to use readHTMLTable to store some data in a dataframe in R Studio, but it just keeps telling me could not find function "ReadHTMLTable". I don't understand where I did wrong. Can someone take a lot at this and tell me how I can fix this? or if it works in your R studio.
url <- 'http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/case-counts.html'
ebola <- getURL(url)
ebola <- readHTMLTable(ebola, stringAsFactors = F)
Error: could not find function "readHTMLTable"
You are reading the table in with R default which converts characters to factors. You can use stringsAsFactors = FALSE in readHTMLTable and this will be passed to data.frame. Also the table uses commas for thousand seperators which you will need to remove :
library(XML)
url1 <-'http://en.wikipedia.org/wiki/List_of_Ebola_outbreaks'
df1<- readHTMLTable(url1, which = 2, stringsAsFactors = FALSE)
df1$"Human death"
mySum <- sum(as.integer(gsub(",", "", df1$"Human death")))
> mySum
[1] 6910
The problem is that you dont initialize de XML library
library(XML)