R web scraping: I can't pull up the elements I want - html

I'm a beginner in web scraping using R. I'm trying to scrape the following webpage: https://bkmea.com/bkmea-members/#/company/2523.
I would like to get all text elements under div nodes with class="company_name", as well as text elements under td nodes. For example, I'm trying to fetch the company name ("MOMO APPARELS") as in the following HTML text.
<div class="comapny_header">
<div class="company_name">MOMO APPARELS LTD</div>
<div class="view_all">View All</div>
</div>
So I've written the following code:
library(textreadr)
library(rvest)
companyinfo <- read_html("https://bkmea.com/bkmea-members/#/company/2523")
html_nodes(companyinfo,"div")%>%
html_text() # it works
html_nodes(companyinfo,"div.company_name")%>%
html_text() # doesn't work
html_nodes(companyinfo,"td") %>%
html_text() # doesn't work
If I understand correctly - the first one should pull up texts with div nodes.
The second one should pull up texts within div nodes with attributes equal to company_name.
The third one should pull up texts within td nodes.
The first one works (which isn't what I'm trying to get) but the second and the third ones don't - am I doing something terribly wrong?
I'd really appreciate it if you could help me out here!!
Many thanks,
Sang

The data you're looking for is retrieved by this API (it is not present in the html body) :
GET https://bkmea.com/wp-admin/admin-ajax.php?action=bkmea_get_company&id=2523
You just need to extract the id from your original url, build the url above and parse json result as following :
library(httr)
originalUrl <- "https://bkmea.com/bkmea-members/#/company/2523"
id <- sub("^.+/", "", originalUrl)
userAgent <- "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
output <- content(GET("https://bkmea.com/wp-admin/admin-ajax.php", query = list(
"action" = "bkmea_get_company",
"id" = id
), add_headers('User-Agent' = userAgent)), as = "parsed", type = "application/json")
print(output$company$company_info$company_name)
output :
[1] "MOMO APPARELS LTD"

Related

Scraping an HTML Table which is returning a list of 0

I am trying to scrape a table from OECD website about FDI b/w 2005-2021. But when I run the code for the table using html_table, it's returning a list of 0.
I tried the same code with a different table and it worked fine, but this one is not working.
library(rvest)
library(dplyr)
link = "https://data.oecd.org/fdi/fdi-flows.htm#indicator-table"
page = read_html(link)
table = page %>% html_nodes("table.DataTable") %>% html_table()
the above code is returning 'table2: List of 0' and consequently converting it to a table with %>% . [[1]] is not working because it gives the following error:
Error in .[[1]] : subscript out of bounds

How to extract table, convert it to data frame ,write as csv file and deal with child tables?

I cant write the file as a csv. file there is an error, as I want to Extract bike sharing system data from a Wiki page and convert the data to a data frame. but When I use the head function to see the table or str. function I cant determine the table it came out with so many unorganized details.
Also Note that this HTML page at least contains three child nodes under the root HTML node. So, you will need to use (html_nodes(root_node, "table") function to get all its child nodes:
<html>
<table>(table1)</table>
<table>(table2)</table>
<table>(table3)</table>
...
</html>
url<- "https://en.wikipedia.org/wiki/List_of_bicycle-sharing_systems"
root_node<-read_html(url)
table_nodes <- html_nodes(root_node,"table")
Bicycle_sharing <- html_table(table_nodes, fill = TRUE )
head(Bicycle_sharing)
summary(Bicycle_sharing)
str(Bicycle_sharing)
## Exporting the date frame as csv. file.
write.csv(mtcars,"raw_bike_sharing_systems.csv",row.names = FALSE)
library(tidyverse)
library(rvest)
data <- "https://en.wikipedia.org/wiki/List_of_bicycle-sharing_systems" %>%
read_html() %>%
html_table() %>%
getElement(2) %>%
janitor::clean_names()
data %>%
write_csv(file = "bike_sharing.csv")

R loop with html_nodes ( rvest ) isn´t catching all data

I would like to make a loop with html_node to catch some the value of nodes (nodes no text), that is, I have some values
library(rvest)
country <- c("Canada", "US", "Japan", "China")
With those values ("Canada","us", ...), I´ve done a loop which creates a URL by pasting each value with "https://en.wikipedia.org/wiki/", after that, with each new html apply read_html(i) and a sequences of codes to catch finally a node with html_nodes ('a.page-link') -yes! a node, not a text- and save that html_nodes (...) as.character in a data.frame (or could be a list).
dff<- NULL
for ( i in country ) {
url<-paste0("https://en.wikipedia.org/wiki/",i)
page<- read_html(url)
b <- page%>%
html_nodes ('h2.flow-title') %>%
html_nodes ('a.page-link') %>%
as.character()
dff<- data.frame(b)
}
The problem is this code only save the data from the last country, that is, run the first country and obtain the html_nodes(saving it), but when run the next country the first data is erased and replace by this new , and so on, obtaining as final result just the dat from the last country.
I would be grateful with your help!
As the comment mentioned this line: dff<- data.frame(b) is over writing dff on each loop iteration. On solution is to create an empty list and append the data to the list.
In this example the list items are named for the country queried.
library(rvest)
country <- c("Canada", "US", "Japan", "China")
#initialize the empty list
dff<- list()
for ( i in country ) {
url<-paste0("https://en.wikipedia.org/wiki/",i)
page<- read_html(url)
b <- page%>%
html_nodes ('h2.flow-title') %>%
html_nodes ('a.page-link') %>%
as.character()
#append new data onto the list
dff[[i]]<- data.frame(b)
}
To access the data, one can use dff$Canada, or lapply to process the entire list.
Note: I ran your example which returned no results, better double check the node ids.

Extracting href attr or converting node to character list

I try to extract some information from the website
library(rvest)
library(XML)
url <- "http://wiadomosci.onet.pl/wybory-prezydenckie/xcnpc"
html <- html(url)
nodes <- html_nodes(html, ".listItemSolr")
nodes
I get "list" of 30 parts of HTML code. I want from each element of the "list" extract last href attribute, so for the 30. element it would be
<a href="http://wiadomosci.onet.pl/kraj/w-sobote-prezentacja-hasla-i-programu-wyborczego-komorowskiego/tvgcq" title="W sobotę prezentacja hasła i programu wyborczego Komorowskiego">
so I want to get string
"http://wiadomosci.onet.pl/kraj/w-sobote-prezentacja-hasla-i-programu-wyborczego-komorowskiego/tvgcq"
The problem is html_attr(nodes, "href") doesn't work (I get vector of NA's). So I thought about regex but the problem is that nodes isn't the character list.
class(nodes)
[1] "XMLNodeSet"
I tried
xmlToList(nodes)
but it doesn't work either.
So my question is: how can I extract this url with some function created for HTML? Or, if it is not possible how can I get convert XMLNodeSet to character list?
Try searching inside nodes' children:
nodes <- html_nodes(html, ".listItemSolr")
sapply(html_children(nodes), function(x){
html_attr( x$a, "href")
})
Update
Hadley suggested using elegant pipes:
html %>%
html_nodes(".listItemSolr") %>%
html_nodes(xpath = "./a") %>%
html_attr("href")
Package XML function getHTMLLinks() can do virtually all the work for us, we just have to write the xpath query. Here we query all the node attributes to determine if any contains "listItemSolr", then select the parent node for the href query.
getHTMLLinks(url, xpQuery = "//#*[contains(., 'listItemSolr')]/../a/#href")
In xpQuery we are doing the following:
//#*[contains(., 'listItemSolr')] query all node attributes for listItemSolr
/.. select the parent node
/a/#href get the href links

Using \Sexpr{} in LaTeX tabular environment

I am trying to use \Sexpr{} to include values from my R objects in a LaTeX table. I am essentially trying to replicate the summary output of a lm object in R because xtable's built in methods xtable.lm and xtable.summary.lm don't seem to include the Fstats, adjusted R-squared, etc (all the stuff at the bottom of the summary printout of the lm object in R console) So I tried accomplishing this by building a matrix to replicate the xtable.summary.lm output then construct a data frame of the relevant info for the extra stuff so I can refer to the values using \Sexpr{}. I tried doing this by using add.to.row to append the \multicolumn{} command in order to merge all columns of the last row of the LaTeX table and then just pass all the information I need into that cell of the table.
The problem is that I get an "Undefined control sequence" for the \Sexpr{} expression in the \multicolumn{} expression. Are these two not compatible? If so, what am I doing wrong and if not does anyone know how to do what I am trying to do?
Thanks,
Here is the relevant part of my code:
<<Test, results=tex>>=
model1 <- lm(stndfnl ~ atndrte + frosh + soph)
# Build matrix to replicate xtable.summary.lm output
x <- summary(model1)
colnames <- c("Estimate", "Std. Error", "t value", "Pr(<|t|)")
rownames <- c("(Intercept)", attr(x$terms, "term.labels"))
fpval <- pf(x$fstatistic[1],x$fstatistic[2], x$fstatistic[3], lower.tail=FALSE)
mat1 <- matrix(coef(x), nrow=length(rownames), ncol=length(colnames), dimnames=list(rownames,colnames))
# Make a data frame for extra information to be called by \Sexpr in last row of table
residse <- x$sigma
degf <- x$df[2]
multr2 <- x$r.squared
adjr2 <- x$adj.r.squared
fstat <- x$fstatistic[1]
fstatdf1 <- x$fstatistic[2]
fstatdf2 <- x$fstatistic[3]
extradat <- data.frame(v1 = round(residse,4), v2 =degf, v3=round(multr2,4), v4=round(adjr2,4),v5=round(fstat,3), v6=fstatdf1, v7=fstatdf2, v8=round(fpval,6))
addtorow<- list()
addtorow$pos <-list()
addtorow$pos[[1]] <- dim(mat1)[1]
addtorow$command <-c('\\hline \\multicolumn{5}{l}{Residual standard error:\\Sexpr{extradat$v1}} \\\\ ')
print(xtable(mat1, caption="Summary Results for Regression in Equation \\eqref{model1} ", label="tab:model1"), add.to.row=addtorow, sanitize.text.function=NULL, caption.placement="top")
You don't need to have Sexpr in your R code; the R code can use the expressions directly. Sexpr is not a LaTeX command, even though it looks like one; it's an Sweave command, so it doesn't work to have it as output from R code.
Try
addtorow$command <-paste('\\hline \\multicolumn{5}{l}{Residual standard error:',
extradat$v1, '} \\\\ ')
Also, no need to completely recreate the matrix used by xtable, you can just build on the default output. Building on what you have above, something like:
mytab <- xtable(model1, caption="Summary Results", label="tab:model1")
addtorow$pos[[1]] <- dim(mytab)[1]
print(mytab, add.to.row=addtorow, sanitize.text.function=NULL,
caption.placement="top")
See http://people.su.se/~lundh/reproduce/sweaveintro.pdf for an example which you might be able to use as is.