Issue loading HTML Table into R - html

I want to load the table at the bottom of the following webpage into R, either as a dataframe or table: https://www.lawschooldata.org/school/Yale%20University/18. My first instinct was to use the readHTMLTable function in the XML package
library(XML)
url <- "https://www.lawschooldata.org/school/Yale%20University/18"
##warning message after next line
table <- readHTMLTable(url)
table
However, this returns an empty list and gives me the following warning:
Warning message:XML content does not seem to be XML: ''
I also tried adapting code I found here Scraping html tables into R data frames using the XML package. This worked for 5 of the 6 tables on the page, but just returned the header row and one row with values from the header row for the 6th table, which is the one I am interested in. Code below:
library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://www.lawschooldata.org/school/Yale%20University/18",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
##generates a list of the 6 tables on the page
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
##takes the 6th table, which is the one I am interested in
applicanttable <- tables[[6]]
##the problem is that this 6th table returns just the header row and one row of values
##equal to those the header row
head(applicanttable)
Any insights would be greatly appreciated! For reference, I have also consulted the following posts that appear to have similar goals, but could not find a solution there:
Scraping html tables into R data frames using the XML package
Extracting html table from a website in R

The data is dynamically pulled from a nested JavaScript array, within a script tag when JavaScript runs in the browser. This doesn't happen when you use rvest to retrieve the non-rendered content (as seen in view-source).
You can regex out the appropriate nested array and then re-construct the table by splitting out the rows, adding the appropriate headers and performing some data manipulations on various columns e.g. some columns contain html which needs to be parsed to obtain the desired value.
As some columns e.g. Name contain values which could be interpreted as file paths , when using read_html, I use htmltidy to ensure handling as valid html.
N.B. If you use RSelenium then the page will render and you can just grab the table direct without reconstructing it.
TODO:
There are still some data type manipulations you could choose to apply to a few columns.
There is some more logic to be applied to ensure only Name is returned in Name column. Take the case of df$Name[10], this returns "Character and fitness issues" instead of Anxiousboy, due to the required value actually sitting in element.nextSibling.nextSibling of the p tag which is actually selected. These, infrequent, edge cases, need some additional logic built in. In this case, you might test for a particular string being returned then resort to re-parsing with an xpath expression.
R:
library(rvest)
#> Loading required package: xml2
#> Warning: package 'xml2' was built under R version 4.0.3
library(stringr)
library(htmltidy)
#> Warning: package 'htmltidy' was built under R version 4.0.3
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
get_value <- function(input) {
value <- tidy_html(input) %>%
read_html() %>%
html_node("a, p, span") %>%
html_text(trim = T)
result <- ifelse(is.na(value), input, value)
return(result)
}
tidy_result <- function(result) {
return(gsub("<.*", "", result))
}
page <- read_html("https://www.lawschooldata.org/school/Yale%20University/18")
s <- page %>% toString()
headers <- page %>%
html_nodes("#applicants-table th") %>%
html_text(trim = T)
s <- stringr::str_extract(s, regex("DataTable\\(\\{\n\\s+data:(.*\\n\\]\\n\\])", dotall = T)) %>%
gsub("\n", "", .)
rows <- stringr::str_extract_all(s, regex("(\\[.*?\\])", dotall = T))[[1]] %>% as.list()
df <- sapply(rows, function(x) {
stringr::str_match_all(x, "'(.*?)'")[[1]][, 2]
}) %>%
t() %>%
as_tibble(.name_repair = "unique")
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> * `` -> ...4
#> * `` -> ...5
#> * ...
names(df) <- headers
df <- df %>%
rowwise() %>%
mutate(across(c("Name", "GRE", "URM", "$$$$"), .f = get_value)) %>%
mutate_at(c("Result"), tidy_result)
write.csv(df, "Yale Applications.csv")
Created on 2021-06-23 by the reprex package (v0.3.0)
Sample output:

Related

How can I filter out numbers from an html table in R?

I am currently working on a forecasting model and to do this I would like to import data from an HTML website into R and save the values-part of the data set into a new list.
I have used the following approach in R:
# getting website data:
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
document <- htmlParse(GET(link, user_agent("Mozilla")))
removeNodes(getNodeSet(document,"//*/comment()"))
doc.tables<-readHTMLTable(document)
# show BID/ASK block:
doc.tables[2]
Which (doc.tables[2]) gives me in this case the result:
$`NULL`
Bid 0,765
1 Ask 0,80
How can i filter out the numbers (0,765 & 0,80) of the table, to save it into a list?
The issue is the 0.765 is actually the name of your data.frame column.
Your data frame being doc.tables[[2]]
You can grab the name by calling names(doc.tables[[2]])[2])
store that as a variable like name <- names(doc.tables[[2]])[2])
then you can grab the 0,80 by using doc.tables[[2]][[2]], store that as a variable if you like.
Final code should look like... my_list <- list(name, doc.tables[[2]][[2]])
Here is a way with rvest, not package XML.
The code below uses two more packages, stringr and readr, to extract the values and their names.
library(httr)
library(rvest)
library(dplyr)
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
page <- read_html(link)
tbl <- page %>%
html_elements("tr") %>%
html_text() %>%
.[3:4] %>%
stringr::str_replace_all(",", ".")
tibble(name = stringr::str_extract(tbl, "Ask|Bid"),
value = readr::parse_number(tbl))
#> # A tibble: 2 x 2
#> name value
#> <chr> <dbl>
#> 1 Bid 0.765
#> 2 Ask 0.8
Created on 2022-03-26 by the reprex package (v2.0.1)
Without saving the pipe result to a temporary object, tbl, the pipe can continue as below.
library(httr)
library(rvest)
library(stringr)
suppressPackageStartupMessages(library(dplyr))
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
page <- read_html(link)
page %>%
html_elements("tr") %>%
html_text() %>%
.[3:4] %>%
str_replace_all(",", ".") %>%
tibble(name = str_extract(., "Ask|Bid"),
value = readr::parse_number(.)) %>%
.[-1]
#> # A tibble: 2 x 2
#> name value
#> <chr> <dbl>
#> 1 Bid 0.765
#> 2 Ask 0.8
Created on 2022-03-27 by the reprex package (v2.0.1)
This is building on Jahi Zamy’s observation that some of your data are showing up as column names and on the example code in the question.
library(httr)
library(XML)
# getting website data:
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
document <- htmlParse(GET(link, user_agent("Mozilla")))
# readHTMLTable() assumes tables have a header row by default,
# but these tables do not, so use header=FALSE
doc.tables <- readHTMLTable(document, header=FALSE)
# Extract column from BID/ASK table
BidAsk = doc.tables1[[2]][,2]
# Replace commas with point decimal separator
BidAsk = as.numeric(gsub(",", ".", BidAsk))
# Convert to numeric
BidAsk = as.numeric(BidAsk)

Scrape object from html with rvest

I am new in web scraping with r and I am trying to get a daily updated object which is probably not text. The url is
here and I want to extract the daily situation table in the end of the page. The class of this object is
class="aem-GridColumn aem-GridColumn--default--12 aem-GridColumn--offset--default--0"
I am not really experienced with html and css so if you have any useful source or advice on how I can extract objects from a webpage I would really appreciate it, since SelectorGadget in that case indicate "No valid path found."
Without getting into the business of writing web scrapers, I think this should help you out:
library(rvest)
url = 'https://covid19.public.lu/en.html'
source = read_html(url)
selection = html_nodes( source , '.cmp-gridStat__item-container' ) %>% html_node( '.number' ) %>% html_text() %>% toString()
We can convert the text obtained from Daily situation update using vroom package
library(rvest)
library(vroom)
url = 'https://covid19.public.lu/en.html'
df = url %>%
read_html() %>%
html_nodes('.cmp-gridStat__item-container') %>%
html_text2()
vroom(df, delim = '\\n', col_names = F)
# A tibble: 22 x 1
X1
<chr>
1 369 People tested positive for COVID-19
2 Per 100.000 inhabitants: 58,13
3 Unvaccinated: 91,20
Edit:
html_element vs html_elemnts
The pout of html_elemnts (html_nodes) is,
[1] "369 People tested positive for COVID-19\n\nPer 100.000 inhabitants: 58,13\n\nUnvaccinated: 91,20\n\nVaccinated: 41,72\n\nRatio Unvaccinated / Vaccinated: 2,19\n\n "
[2] "4 625 Number of PCR tests performed\n\nPer 100.000 inhabitants: 729\n\nPositivity rate in %: 7,98\n\nReproduction rate: 0,97"
[3] "80 Hospitalizations\n\nNormal care: 57\nIntensive care: 23\n\nNew deaths: 1\nTotal deaths: 890"
[4] "6 520 Vaccinations per day\n\nDose 1: 785\nDose 2: 468\nComplementary dose: 5 267"
[5] "960 315 Total vaccines administered\n\nDose 1: 452 387\nDose 2: 395 044\nComplementary dose: 112 884"
and that of html_element (html_node)` is
[1] "369 People tested positive for COVID-19\n\nPer 100.000 inhabitants: 58,13\n\nUnvaccinated: 91,20\n\nVaccinated: 41,72\n\nRatio Unvaccinated / Vaccinated: 2,19\n\n "
As you can see html_nodes returns all value associated with the nodes whereashtml_node only returns the first node. Thus, the former fetches you all the nodes which is really helpful.
html_text vs html_text2
The html_text2retains the breaks in strings usually \n and \b. These are helpful when working with strings.
More info is in rvest documentation,
https://cran.r-project.org/web/packages/rvest/rvest.pdf
There is probably a much more elegant way to do this efficiently, but when I need brute force something like this, I try to break it down into small parts.
Use the httr library to get the raw html.
Use str_extract from the stringr library to extract the specific piece of data from the html.
I use both a positive lookbehind and lookahead regex to get the exact piece of data I need. It basically takes the form of "?<=text_right_before).+?(?=text_right_after)
library(httr)
library(stringr)
r <- GET("https://covid19.public.lu/en.html")
html<-content(r, "text")
normal_care=str_extract(html, regex("(?<=Normal care: ).+?(?=<br>)"))
intensive_care=str_extract(html, regex("(?<=Intensive care: ).+?(?=</p>)"))
I wondered if you could get the same data from any of their public APIs. If you simply want a pdf with that table (plus lots of other tables of useful info) you can use the API to extract.
If you want as a DataFrame (resembling as per webpage) you can write a user defined function, with the help of pdftools, to reconstruct the table from the pdf. Bit more effort but as you already have other answers covering using rvest thought I'd have a look at this. I looked at tabularize but that wasn't particularly effective.
More than likely, you could pull several of the API datasets together to get the full content without the need to parse the pdf publication I use e.g. there is an Excel spreadsheet that gives the case numbers.
N.B. There are a few bottom calcs from the webpage not included below. I have only processed the testing info table from the pdf.
Rapports journaliers:
https://data.public.lu/en/datasets/covid-19-rapports-journaliers/#_
https://download.data.public.lu/resources/covid-19-rapports-journaliers/20211210-165252/coronavirus-rapport-journalier-10122021.pdf
API datasets:
https://data.public.lu/api/1/datasets/#
library(tidyverse)
library(jsonlite)
## https://data.library.virginia.edu/reading-pdf-files-into-r-for-text-mining/
# install.packages("pdftools")
library(pdftools)
r <- jsonlite::read_json("https://data.public.lu/api/1/datasets/#")
report_index <- match(TRUE, map(r$data, function(x) x$slug == "covid-19-rapports-journaliers"))
latest_daily_covid_pdf <- r$data[[report_index]]$resources[[1]]$latest # coronavirus-rapport-journalier
filename <- "covd_daily.pdf"
download.file(latest_daily_covid_pdf, filename, mode = "wb")
get_latest_daily_df <- function(filename) {
data <- pdf_text(filename)
text <- data[[1]] %>% strsplit(split = "\n{2,}")
web_data <- text[[1]][3:12]
df <- map(web_data, function(x) strsplit(x, split = "\\s{2,}")) %>%
unlist() %>%
matrix(nrow = 10, ncol = 5, byrow = T) %>%
as_tibble()
colnames(df) <- text[[1]][2] %>%
strsplit(split = "\\s{2,}") %>%
map(function(x) gsub("(.*[a-z])\\d+", "\\1", x)) %>%
unlist()
title <- text[[1]][1] %>%
strsplit(split = "\n") %>%
unlist() %>%
tail(1) %>%
gsub("\\s+", " ", .) %>%
gsub(" TOTAL", "", .)
colnames(df)[2:3] <- colnames(df)[2:3] %>% paste(title, ., sep = " ")
colnames(df)[4:5] <- colnames(df)[4:5] %>% paste("TOTAL", ., sep = " ")
colnames(df)[1] <- "Metric"
clean_col <- function(x) {
gsub("\\s+|,", "", x) %>% as.numeric()
}
clean_col2 <- function(x) {
gsub("\n", " ", gsub("([a-z])(\\d+)", "\\1", x))
}
df <- df %>% mutate(across(.cols = -c(colnames(df)[1]), clean_col),
Metric = clean_col2(Metric)
)
return(df)
}
View(get_latest_daily_df(filename))
Output:
Alternate:
If you simply want to pull items then process you could extract each column as an item in a list. Replace br elements such that the content within those end up in a comma separated list:
library(rvest)
library(magrittr)
library(stringi)
library(xml2)
page <- read_html("https://covid19.public.lu/en.html")
xml_find_all(page, ".//br") %>% xml_add_sibling("span", ",") #This method from https://stackoverflow.com/a/46755666 #hrbrmstr
xml_find_all(page, ".//br") %>% xml_remove()
columns <- page %>% html_elements(".cmp-gridStat__item")
map(columns, ~ .x %>%
html_elements("p") %>%
html_text(trim = T) %>%
gsub("\n\\s{2,}", " ", .)
%>%
stri_remove_empty())

web-scraping: web-scraped object doesn't match information on the website and crashes RStudio [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I collected a series of URLs similar to this one. For each URL, I am using the rvest package to web-scrape information related to the address of every practitioner listed in each box of the webpage. By inspecting the HTML structure of the webpage, I could notice that the information I am trying to retrieve is present inside the HTML division called unit size1of2 (which appears, by hovering with the cursor, as div.unit.size1of2). Then, I used the following code to extract the information I need:
library(rvest)
library(xlm2)
webpage <- read_html(x = "myURL")
webpage_name <- webpage %>%
html_nodes("div.unit.size1of2") %>%
html_text(trim = T)
However, when I extract the information, the result I get it's super messy. First of all, there are information I didn't want to scrape, some of them seems to not even be present on the website. In addition, my RStudio IDE freezes for a while, and every time I try to output the result, without working properly afterwards with any command. Finally, the result is not the one I was looking for.
Do you think this is due to some kind of protection present on the website?
Thank you for your help!
You can start iterating on rows which can be selected using div.search-result .line and then :
getting the name using div:first-child h3
getting the ordinal using div:first-child p
getting the location by iterating on div:nth-child(2) p since there can be multiple locations (one has 5 locations on your page) and store them in a list
It's necessary to remove the tabs and new lines using gsub("[\t\n]", "", x) for the name and ordinal. For the addresses, you can get the text and split according to new line \n, remove duplicates new line and strip the first and last line to have a list like :
[1] "CABINET VÉTÉRINAIRE DV FEYS JEAN-MARC"
[2] "Cabinet Veterinaire"
[3] "ZA de Kercadiou"
[4] "XXXXX"
[5] "LANVOLLON"
[6] "Tél : 0X.XX.XX.XX.XX"
The following code also converts the list of vectors to a dataframe with all the data on that page :
library(rvest)
library(plyr)
url = "https://www.veterinaire.fr/annuaires/trouver-un-veterinaire-pour-soigner-mon-animal.html?tx_siteveterinaire_general%5B__referrer%5D%5B%40extension%5D=SiteVeterinaire&tx_siteveterinaire_general%5B__referrer%5D%5B%40vendor%5D=SiteVeterinaire&tx_siteveterinaire_general%5B__referrer%5D%5B%40controller%5D=FrontendUser&tx_siteveterinaire_general%5B__referrer%5D%5B%40action%5D=search&tx_siteveterinaire_general%5B__referrer%5D%5Barguments%5D=YToxOntzOjY6InNlYXJjaCI7YTo1OntzOjM6Im5vbSI7czowOiIiO3M6NjoicmVnaW9uIjtzOjA6IiI7czoxMToiZGVwYXJ0ZW1lbnQiO3M6MDoiIjtzOjU6InZpbGxlIjtzOjA6IiI7czoxMjoiaXRlbXNQZXJQYWdlIjtzOjI6IjEwIjt9fQ%3D%3D21a1899f9a133814dfc1eb4e01b3b47913bd9925&tx_siteveterinaire_general%5B__referrer%5D%5B%40request%5D=a%3A4%3A%7Bs%3A10%3A%22%40extension%22%3Bs%3A15%3A%22SiteVeterinaire%22%3Bs%3A11%3A%22%40controller%22%3Bs%3A12%3A%22FrontendUser%22%3Bs%3A7%3A%22%40action%22%3Bs%3A6%3A%22search%22%3Bs%3A7%3A%22%40vendor%22%3Bs%3A15%3A%22SiteVeterinaire%22%3B%7D7cd75ca141359a98763248c24da8103293a53d08&tx_siteveterinaire_general%5B__trustedProperties%5D=a%3A1%3A%7Bs%3A6%3A%22search%22%3Ba%3A5%3A%7Bs%3A3%3A%22nom%22%3Bi%3A1%3Bs%3A6%3A%22region%22%3Bi%3A1%3Bs%3A11%3A%22departement%22%3Bi%3A1%3Bs%3A5%3A%22ville%22%3Bi%3A1%3Bs%3A12%3A%22itemsPerPage%22%3Bi%3A1%3B%7D%7D86c9510d17c093c44d053714ab20567929a45f9d&tx_siteveterinaire_general%5Bsearch%5D%5Bnom%5D=&tx_siteveterinaire_general%5Bsearch%5D%5Bregion%5D=&tx_siteveterinaire_general%5Bsearch%5D%5Bdepartement%5D=&tx_siteveterinaire_general%5Bsearch%5D%5Bville%5D=&tx_siteveterinaire_general%5Bsearch%5D%5BitemsPerPage%5D=100&tx_siteveterinaire_general%5B%40widget_0%5D%5BcurrentPage%5D=127&cHash=8d8dc78e004b4b9d0ecfdf9b884f54ca"
rows <- read_html(url) %>%
html_nodes("div.search-result .line")
strip <- function (x) gsub("[\t\n]", "", x)
i <- 1
data = list()
for(r in rows){
addresses = list()
j <- 1
locations = r %>% html_nodes("div:nth-child(2) p")
for(loc in locations){
addresses[[j]] <- loc %>% html_text() %>%
gsub("[\t]", "", .) %>% #remove tabs
gsub('([\n])\\1+', '\\1', .) %>% #remove duplicate \n
gsub('^\n|\n$', '', .) %>% #remove 1st and last \n
strsplit(., split='\n', fixed=TRUE) #split by \n
j <- j + 1
}
data[[i]] <- c(
name = r %>% html_nodes("div:first-child h3") %>% html_text() %>% strip(.),
ordinal = r %>% html_nodes("div:first-child p") %>% html_text() %>% strip(.),
addresses = addresses
)
i <- i + 1
}
df = rbind.fill(lapply(data,function(y){as.data.frame(t(y),stringsAsFactors=FALSE)}))
#show data
print(df)
for(i in 1:3){
print(paste("name",df[i,"name"]))
print(paste("ordinal",df[i,"ordinal"]))
print(paste("addresses",df[i,"addresses"]))
print(paste("addresses1",df[i,"addresses1"]))
print(paste("addresses2",df[i,"addresses2"]))
print(paste("addresses3",df[i,"addresses3"]))
}

Read HTML Table Into Data Frame with Hyperlinks in R

I am trying to read an HTML table from a publicly-accessible website into a data frame in R. The final column of the table contains hyperlinks, and I would like to read these hyperlinks into the table rather than the text that is displayed on the webpage. I've reviewed several posts here on StackOverflow and on other sites and have gotten almost there, but I haven't been able to read the hyperlinks themselves.
The table I'm trying to read is here: http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey.
The final column contains hyperlinks that point to the actual data in *.ZIP file format for download. I've managed to read the table into R as text, but I can't figure out how to resolve the hyperlinks in the final column.
Here's what I have so far:
library(XML)
webURL <- 'http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey'
page <- htmlParse( webURL )
tableNodes <- getNodeSet( sitePage, "//table" )
myTable <- readHTMLTable( tableNodes[[3]] )
However, this contains the text in the final column, not the hyperlink. How do I replace the word "zip" in the final column of this table in R with the values for the corresponding hyperlink in each row?
I find using the rvest package easier than XML.
Here is a solution to obtain a list of the links:
webURL <- 'http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey'
library(rvest)
page<-read_html(webURL)
links<-page %>% html_nodes("a") %>% html_attr("href")
This code will let you target either the XML files or the CSV files and you get the filename as well as the URL so you can then iterate over the URLs and filenames and save them with names you'll recognize later on.
library(rvest)
library(dplyr)
pg <- read_html("http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey")
csv_fils <- html_nodes(pg, xpath=".//td[contains(#class, 'labelOptional_ind') and contains(., 'csv')]/..")
data_frame(
fil_name = html_nodes(csv_fils, "td.labelOptional_ind") %>% html_text(),
url = html_nodes(csv_fils, xpath=".//td[4]/div/a") %>% html_attr("href")
) -> csv_df
glimpse(csv_df)
## Observations: 1,560
## Variables: 2
## $ fil_name <chr> "cdr.00012300.0000000000000000.20170729.094015151.LMPSROSNODENP6788_20170729_094011_csv.zip", "cdr...
## $ url <chr> "/misdownload/servlets/mirDownload?mimic_duns=&doclookupId=572923018", "/misdownload/servlets/mirD...
xml_fils <- html_nodes(pg, xpath=".//td[contains(#class, 'labelOptional_ind') and contains(., 'xml')]/..")
data_frame(
fil_name = html_nodes(xml_fils, "td.labelOptional_ind") %>% html_text(),
url = html_nodes(xml_fils, xpath=".//td[4]/div/a") %>% html_attr("href")
) -> xml_df
glimpse(xml_df)
## Observations: 1,560
## Variables: 2
## $ fil_name <chr> "cdr.00012300.0000000000000000.20170729.094015016.LMPSROSNODENP6788_20170729_094011_xml.zip", "cdr...
## $ url <chr> "/misdownload/servlets/mirDownload?mimic_duns=&doclookupId=572923015", "/misdownload/servlets/mirD...

rvest cannot find node with xpath

This is the website I scapre
ppp projects
I want to use xpath to select the node like below
The xpath I get by use inspect element is "//*[#id="pppListUl"]/li1/div2/span2/span"
My scrpits are like below:
a <- html("http://www.cpppc.org:8082/efmisweb/ppp/projectLivrary/toPPPList.do")
b <- html_nodes(a, xpath = '//*[#id="pppListUl"]/li[1]/div[2]/span[2]/span')
b
Then I got the result
{xml_nodeset (0)}
Then I check the page source, I didn't even find anything about the project I selected.
I was wondering why I cannot find it in the page source, and in turn, how can I get the node by rvest.
It makes an XHR request for the content. Just work with that data (it's pretty clean):
library(httr)
POST('http://www.cpppc.org:8082/efmisweb/ppp/projectLivrary/getPPPList.do?tokenid=null',
encode="form",
body=list(queryPage=1,
distStr="",
induStr="",
investStr="",
projName="",
sortby="",
orderby="",
stageArr="")) -> res
content(res, as="text") %>%
jsonlite::fromJSON(flatten=TRUE) %>%
dplyr::glimpse()
(StackOverflow isn't advanced enough to let me post the output of that as it thinks it's spam).
It's a 4 element list with fields totalCount, list (which has the actual data), currentPage and totalPage.
It looks like you can change the queryPage form variable to iterate through the pages to get the whole list/database, something like:
library(httr)
library(purrr)
library(dplyr)
get_page <- function(page_num=1, .pb=NULL) {
if (!is.null(.pb)) pb$tick()$print()
POST('http://www.cpppc.org:8082/efmisweb/ppp/projectLivrary/getPPPList.do?tokenid=null',
encode="form",
body=list(queryPage=page_num,
distStr="",
induStr="",
investStr="",
projName="",
sortby="",
orderby="",
stageArr="")) -> res
content(res, as="text") %>%
jsonlite::fromJSON(flatten=TRUE) -> dat
dat$list
}
n <- 5 # change this to the value in `totalPage`
pb <- progress_estimated(n)
df <- map_df(1:n, get_page, pb)