How to click links onto the next page using RCurl? - html

I am trying to scrape this table from this website using RCurl. I am able to do this and put it into a nice dataframe using the code:
clinVar <- getURL("http://www.ncbi.nlm.nih.gov/clinvar/?term=BRCA1")
docForm2 <- htmlTreeParse(clinVar,useInternalNodes = T)
xp_expr = "//table[#class= 'jig-ncbigrid docsum_table\']/tbody/tr"
nodes = getNodeSet(docForm2, xp_expr)
extractedData <- xmlToDataFrame(nodes)
colnames(extractedData) <- c("Info","Gene", "Variation","Freq", "Phenotype","Clinical significance","Status", "Chr","Location")
However, I can only extract the data on the first page, and the table spans multiple pages. How do you access data on the next page? I have looked at the HTML code for the website and the region that the "Next" button exists in is here (I believe!):
<a name="EntrezSystem2.PEntrez.clinVar.clinVar_Entrez_ResultsPanel.Entrez_Pager.Page" title="Next page of results" class="active page_link next" href="#" sid="3" page="3" accesskey="k" id="EntrezSystem2.PEntrez.clinVar.clinVar_Entrez_ResultsPanel.Entrez_Pager.Page">Next ></a>
I would like to know how to access this link using getURL, postForm etc. I think I should be doing something like this, to get data from the second page but it's still just giving me the first page:
url <- "http://www.ncbi.nlm.nih.gov/clinvar/?term=BRCA1"
clinVar <- postForm(url,
"EntrezSystem2.PEntrez.clinVar.clinVar_Entrez_ResultsPanel.Entrez_Pager.cPage" ="2")
docForm2 <- htmlTreeParse(clinVar,useInternalNodes = T)
xp_expr = "//table[#class= 'jig-ncbigrid docsum_table\']/tbody/tr"
nodes = getNodeSet(docForm2, xp_expr)
extractedData <- xmlToDataFrame(nodes)
colnames(extractedData) <- c("Info","Gene", "Variation","Freq","Phenotype","Clinical significance","Status", "Chr","Location")
Thanks to anyone who can help.

I would use E-utilities to access data at NCBI instead.
url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=clinvar&term=brca1"
readLines(url)
[1] "<?xml version=\"1.0\" ?>"
[2] "<!DOCTYPE eSearchResult PUBLIC \"-//NLM//DTD eSearchResult, 11 May 2002//EN\" \"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd\">"
[3] "<eSearchResult><Count>1080</Count><RetMax>20</RetMax><RetStart>0</RetStart><QueryKey>1</QueryKey><WebEnv>NCID_1_36649974_130.14.18.34_9001_1386348760_356908530</WebEnv><IdList>"
Pass the QueryKey and WebEnv to esummary and get the XML summary (this changes with each esearch, so copy and paste the new keys into the url below)
url2 <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=clinvar&query_key=1&WebEnv=NCID_1_36649974_130.14.18.34_9001_1386348760_356908530"
brca1 <- xmlParse(url2)
Next, view a single record and then extract the fields you need. You may need to loop through the set if there are 0 to many values assigned to a tag. Others like clinical significance description always have 1 value.
getNodeSet(brca1, "//DocumentSummary")[[1]]
table(xpathSApply(brca1, "//clinical_significance/description", xmlValue) )
Benign conflicting data from submitters not provided other
129 22 6 1
Pathogenic probably not pathogenic probably pathogenic risk factor
508 68 19 43
Uncertain significance
284
Also, there are many packages with E-utilities on github and BioC (rentrez, reutils, genomes and others). Using the genomes package on BioC, this simplifies to
brca1 <- esummary( esearch("brca1", db="clinvar"), parse=FALSE )

Using the e-utilities feature on the NCBI database, see http://www.ncbi.nlm.nih.gov/books/NBK25500/ for more details.
## use eSearch feature in eUtilities to search NCBI for ids corresponding to each row of data.
## note to see all ids, not not just top 10 set retmax to a high number
## to get query id and web env info, set usehistory=y
library(RCurl)
library(XML)
baseSearch <- ("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=") ## eSearch
db <- "clinvar" ## database to query
gene <- "BRCA1" ## gene of interest
query <- paste('[gene]+AND+"','clinsig pathogenic"','[Properties]+AND+"','single nucleotide variant"','[Type of variation]&usehistory=y&retmax=1110',sep="") ## query, see below for details
baseFetch <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=" ## base fetch
searchURL <- paste(baseSearch,db, "&term=",gene,query,sep="")
getSearch <- getURL(searchURL)
searchHTML <- htmlTreeParse(searchURL, useInternalNodes =T)
nodes <- getNodeSet(searchHTML,"//querykey") ## this name "querykey" was extracted from the HTML source code for this page
querykey <- xmlToDataFrame(nodes)
nodes <- getNodeSet(searchHTML,"//webenv") ## this name "webenv" was extracted from the HTML source code for this page
webenv <- xmlToDataFrame(nodes)
fetchURL <- paste(baseFetch,db,"&query_key=",querykey,"&WebEnv=",webenv[[1]],"&rettype=docsum",sep="")
getFetch <- getURL(fetchURL)
fetchHTML <- htmlTreeParse(getFetch, useInternalNodes =T)
nodes <- getNodeSet(fetchHTML, "//position")
extractedDataAll <- xmlToDataFrame(nodes)
colnames(extractedDataAll) <- c("pathogenicSNPs")
print(extractedDataAll)
Please note, I found the query information by going to http://www.ncbi.nlm.nih.gov/clinvar/?term=BRCA1 selecting my filters (pathogenic, etc) and then clicking the advanced button. The most recent filters applied should come up in the main box, I used this for the query.

ClinVar now offers XML download of the whole database so webscraping is not necessary.

Related

web-scraping: web-scraped object doesn't match information on the website and crashes RStudio [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I collected a series of URLs similar to this one. For each URL, I am using the rvest package to web-scrape information related to the address of every practitioner listed in each box of the webpage. By inspecting the HTML structure of the webpage, I could notice that the information I am trying to retrieve is present inside the HTML division called unit size1of2 (which appears, by hovering with the cursor, as div.unit.size1of2). Then, I used the following code to extract the information I need:
library(rvest)
library(xlm2)
webpage <- read_html(x = "myURL")
webpage_name <- webpage %>%
html_nodes("div.unit.size1of2") %>%
html_text(trim = T)
However, when I extract the information, the result I get it's super messy. First of all, there are information I didn't want to scrape, some of them seems to not even be present on the website. In addition, my RStudio IDE freezes for a while, and every time I try to output the result, without working properly afterwards with any command. Finally, the result is not the one I was looking for.
Do you think this is due to some kind of protection present on the website?
Thank you for your help!
You can start iterating on rows which can be selected using div.search-result .line and then :
getting the name using div:first-child h3
getting the ordinal using div:first-child p
getting the location by iterating on div:nth-child(2) p since there can be multiple locations (one has 5 locations on your page) and store them in a list
It's necessary to remove the tabs and new lines using gsub("[\t\n]", "", x) for the name and ordinal. For the addresses, you can get the text and split according to new line \n, remove duplicates new line and strip the first and last line to have a list like :
[1] "CABINET VÉTÉRINAIRE DV FEYS JEAN-MARC"
[2] "Cabinet Veterinaire"
[3] "ZA de Kercadiou"
[4] "XXXXX"
[5] "LANVOLLON"
[6] "Tél : 0X.XX.XX.XX.XX"
The following code also converts the list of vectors to a dataframe with all the data on that page :
library(rvest)
library(plyr)
url = "https://www.veterinaire.fr/annuaires/trouver-un-veterinaire-pour-soigner-mon-animal.html?tx_siteveterinaire_general%5B__referrer%5D%5B%40extension%5D=SiteVeterinaire&tx_siteveterinaire_general%5B__referrer%5D%5B%40vendor%5D=SiteVeterinaire&tx_siteveterinaire_general%5B__referrer%5D%5B%40controller%5D=FrontendUser&tx_siteveterinaire_general%5B__referrer%5D%5B%40action%5D=search&tx_siteveterinaire_general%5B__referrer%5D%5Barguments%5D=YToxOntzOjY6InNlYXJjaCI7YTo1OntzOjM6Im5vbSI7czowOiIiO3M6NjoicmVnaW9uIjtzOjA6IiI7czoxMToiZGVwYXJ0ZW1lbnQiO3M6MDoiIjtzOjU6InZpbGxlIjtzOjA6IiI7czoxMjoiaXRlbXNQZXJQYWdlIjtzOjI6IjEwIjt9fQ%3D%3D21a1899f9a133814dfc1eb4e01b3b47913bd9925&tx_siteveterinaire_general%5B__referrer%5D%5B%40request%5D=a%3A4%3A%7Bs%3A10%3A%22%40extension%22%3Bs%3A15%3A%22SiteVeterinaire%22%3Bs%3A11%3A%22%40controller%22%3Bs%3A12%3A%22FrontendUser%22%3Bs%3A7%3A%22%40action%22%3Bs%3A6%3A%22search%22%3Bs%3A7%3A%22%40vendor%22%3Bs%3A15%3A%22SiteVeterinaire%22%3B%7D7cd75ca141359a98763248c24da8103293a53d08&tx_siteveterinaire_general%5B__trustedProperties%5D=a%3A1%3A%7Bs%3A6%3A%22search%22%3Ba%3A5%3A%7Bs%3A3%3A%22nom%22%3Bi%3A1%3Bs%3A6%3A%22region%22%3Bi%3A1%3Bs%3A11%3A%22departement%22%3Bi%3A1%3Bs%3A5%3A%22ville%22%3Bi%3A1%3Bs%3A12%3A%22itemsPerPage%22%3Bi%3A1%3B%7D%7D86c9510d17c093c44d053714ab20567929a45f9d&tx_siteveterinaire_general%5Bsearch%5D%5Bnom%5D=&tx_siteveterinaire_general%5Bsearch%5D%5Bregion%5D=&tx_siteveterinaire_general%5Bsearch%5D%5Bdepartement%5D=&tx_siteveterinaire_general%5Bsearch%5D%5Bville%5D=&tx_siteveterinaire_general%5Bsearch%5D%5BitemsPerPage%5D=100&tx_siteveterinaire_general%5B%40widget_0%5D%5BcurrentPage%5D=127&cHash=8d8dc78e004b4b9d0ecfdf9b884f54ca"
rows <- read_html(url) %>%
html_nodes("div.search-result .line")
strip <- function (x) gsub("[\t\n]", "", x)
i <- 1
data = list()
for(r in rows){
addresses = list()
j <- 1
locations = r %>% html_nodes("div:nth-child(2) p")
for(loc in locations){
addresses[[j]] <- loc %>% html_text() %>%
gsub("[\t]", "", .) %>% #remove tabs
gsub('([\n])\\1+', '\\1', .) %>% #remove duplicate \n
gsub('^\n|\n$', '', .) %>% #remove 1st and last \n
strsplit(., split='\n', fixed=TRUE) #split by \n
j <- j + 1
}
data[[i]] <- c(
name = r %>% html_nodes("div:first-child h3") %>% html_text() %>% strip(.),
ordinal = r %>% html_nodes("div:first-child p") %>% html_text() %>% strip(.),
addresses = addresses
)
i <- i + 1
}
df = rbind.fill(lapply(data,function(y){as.data.frame(t(y),stringsAsFactors=FALSE)}))
#show data
print(df)
for(i in 1:3){
print(paste("name",df[i,"name"]))
print(paste("ordinal",df[i,"ordinal"]))
print(paste("addresses",df[i,"addresses"]))
print(paste("addresses1",df[i,"addresses1"]))
print(paste("addresses2",df[i,"addresses2"]))
print(paste("addresses3",df[i,"addresses3"]))
}

Trying to find hyperlinks by scraping

So I am fairly new to the topic of webscraping. I am trying to find all the hyperlinks that the html code of the following page contains:
https://www.exito.com/mercado/lacteos-huevos-y-refrigerados/leches
So this is what I tried:
url <- "https://www.exito.com/mercado/lacteos-huevos-y-refrigerados/leches"
webpage <- read_html(url)
html_attr(html_nodes(webpage, "a"), "href")
The result only contains like 6 links but just by viewing the page you can see that there are a lot more of hyperlinks.
For example the code behind the first image has something like: <a href="/leche-entera-sixpack-en-bolsa-x-11-litros-cu-807650/p" class="vtex-product-summary-2-x-clearLink h-100 flex flex-column"> ...
What am I doing wrong?
You won't be able to get the a tags you're after because that part of the website is not visible to html/xml parsers. This is because it's a dynamic part of the website that changes if you choose another part of the website; the only 'static' part of the website is the top header, which is why you only got 6 a tags: the six a tags from the header.
For this, we need to mimic the behavior of a browser (firefox, chrome, etc...), go into the website (note that we're not entering the website as an html/xml parser but as a 'user' through a browser) and read the html/xml source code from there.
For this we'll need the R package RSelenium. Make sure you install it correctly together with docker, as none of the code below can work without it.
After you install RSelenium and docker, run docker run -d -p 4445:4444 selenium/standalone-firefox:2.53.1 from your terminal (if on Linux, you can run this the terminal; if on Windows you'll have to download a docker terminal, run it there). After that you're all set to reproduce the code below.
Why you're approach didn't work
We need to access the 5th div tag from the image below:
As you can see, this 5th div tag has three dots (...) inside, denoting that there's code inside: this is precisely where all of the bottom part of the website is (including the a tags that you're after). If we tried to access this 5th tag using rvest or xml2, we won't find anything:
library(xml2)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
lnk <- "https://www.exito.com/mercado/lacteos-huevos-y-refrigerados/leches?page=2"
# Note how the 5th div element is empty and it should contain the lower
# part of the website
lnk %>%
read_html() %>%
xml_find_all("//div[#class='flex flex-grow-1 w-100 flex-column']") %>%
xml_children()
#> {xml_nodeset (6)}
#> [1] <div class=""></div>\n
#> [2] <div class=""></div>\n
#> [3] <div class=""></div>\n
#> [4] <div class=""></div>\n
#> [5] <div class=""></div>\n
#> [6] <div class=""></div>
Note how the 5th div tag doesn't have any code inside. A simple html/xml parser won't catch it.
How it can work
We need to use RSelenium. After you've installed everything correctly, we need to setup a 'remote driver', open it and navigate to the website. All of these steps are just to make sure that we're coming into the website as a 'normal' user from a browser. This will make sure that we can access the rendered code that we actually see when we enter the website. Below are the detailed steps from entering the website and constructing the links.
# Make sure you install docker correctly: https://docs.ropensci.org/RSelenium/articles/docker.html
library(RSelenium)
# After installing docker and before running the code, make sure you run
# the rselenium docker image: docker run -d -p 4445:4444 selenium/standalone-firefox:2.53.1
# Now, set up your remote driver
remDr <- remoteDriver(
remoteServerAddr = "localhost",
port = 4445L,
browserName = "firefox"
)
# Initiate the driver
remDr$open(silent = TRUE)
# Navigate to the exito.com website
remDr$navigate(lnk)
prod_links <-
# Get the html source code
remDr$getPageSource()[[1]] %>%
read_html() %>%
# Find all a tags which have a certain class
# I searched for this tag manually on the website code and saw that all products
# had an a tag that shared the same class
xml_find_all("//a[#class='vtex-product-summary-2-x-clearLink h-100 flex flex-column']") %>%
# Extract the href attribute
xml_attr("href") %>%
paste0("https://www.exito.com", .)
prod_links
#> [1] "https://www.exito.com/leche-semidescremada-deslactosada-en-bolsa-x-900-ml-145711/p"
#> [2] "https://www.exito.com/leche-entera-en-bolsa-x-900-ml-145704/p"
#> [3] "https://www.exito.com/leche-entera-sixpack-x-1300-ml-cu-987433/p"
#> [4] "https://www.exito.com/leche-deslactosada-en-caja-x-1-litro-878473/p"
#> [5] "https://www.exito.com/leche-polvo-deslactos-semidesc-764522/p"
#> [6] "https://www.exito.com/leche-slight-sixpack-en-caja-x-1050-ml-cu-663528/p"
#> [7] "https://www.exito.com/leche-semidescremada-sixpack-en-caja-x-1050-ml-cu-663526/p"
#> [8] "https://www.exito.com/leche-descremada-sixpack-x-1300-ml-cu-563046/p"
#> [9] "https://www.exito.com/of-leche-deslact-pag-5-lleve-6-439057/p"
#> [10] "https://www.exito.com/sixpack-de-leche-descremada-x-1100-ml-cu-414454/p"
#> [11] "https://www.exito.com/leche-en-polvo-klim-fortificada-360g-239085/p"
#> [12] "https://www.exito.com/leche-deslactosada-descremada-en-caja-x-1-litro-238291/p"
#> [13] "https://www.exito.com/leche-deslactosada-en-caja-x-1-litro-157334/p"
#> [14] "https://www.exito.com/leche-entera-larga-vida-en-caja-x-1-litro-157332/p"
#> [15] "https://www.exito.com/leche-en-polvo-klim-fortificada-780g-138121/p"
#> [16] "https://www.exito.com/leche-entera-en-bolsa-x-1-litro-125079/p"
#> [17] "https://www.exito.com/leche-entera-en-bolsa-sixpack-x-11-litros-cu-59651/p"
#> [18] "https://www.exito.com/leche-deslactosada-descremada-sixpack-x-11-litros-cu-22049/p"
#> [19] "https://www.exito.com/leche-entera-en-polvo-instantanea-x-760-gr-835923/p"
#> [20] "https://www.exito.com/of-alpin-cja-cho-pag9-llev12/p"
Hope this answers your questions
The data, including the urls, are returned dynamically from a GraphQL query you can observe in the network tab when clicking Mostrar más on the page. This is why the content is not present in your initial query - it has not yet been requested.
XHR for the product info
The relevant XHR in the network tab of dev tools:
The actual query params of the url query string:
You can do away with most of the request info. What you do need is the extensions param. More specifically, you need to provide the sha256Hash and the base64 encoded string value associated with the variables key in the persistedQuery.
The SHA256 Hash
The appropriate hash can be extracted from at least one of the js files which essentially governs the set up. An example file you can use is:
https://exitocol.vtexassets.com/_v/public/assets/v1/published/bundle/public/react/asset.min.js?v=1&files=vtex.store-resources#0.38.0,OrderFormContext,Mutations,Queries,PWAContext&files=exitocol.store-components#0.0.2,common,11,3,SearchBar&files=vtex.responsive-values#0.2.0,common,useResponsiveValues&files=vtex.slider#0.7.3,common,0,Dots,Slide,Slider,SliderContainer&files=exito.components#4.0.7,common,0,1,3,4&workspace=master.
The query hash can be regex'd from the response text of an xhr request to this uri. The regex is explained here and the first match is sufficient:
To apply in R, with stringr, you will need some extra escapes in e.g. \s becomes \\s.
The Base64 encoded product query
The base64 encoded string you can generate yourself with the appropriate library e.g. it seems there is a base64encode R function in caTools package.
The encoded string looks like (depending on page/result batch):
eyJ3aXRoRmFjZXRzIjpmYWxzZSwiaGlkZVVuYXZhaWxhYmxlSXRlbXMiOmZhbHNlLCJza3VzRmlsdGVyIjoiQUxMX0FWQUlMQUJMRSIsInF1ZXJ5IjoiMTQ4IiwibWFwIjoicHJvZHVjdENsdXN0ZXJJZHMiLCJvcmRlckJ5IjoiT3JkZXJCeVRvcFNhbGVERVNDIiwiZnJvbSI6MjAsInRvIjozOX0=
Decoded:
{"withFacets":false,"hideUnavailableItems":false,"skusFilter":"ALL_AVAILABLE","query":"148","map":"productClusterIds","orderBy":"OrderByTopSaleDESC","from":20,"to":39}
The from and to params are the offsets for the results batches of products which come in batches of twenty. So, you can write functions which return the appropriate sha256 hash and send a subsequent request for product info where you base64 encode, with the appropriate library, the string above and alter the from and to params as required. Potentially others as well (have a play!).
The xhr response:
The response is json so you might need a json library (e.g. jsonlite) to handle the result (UPDATE: Seems you don't with R and httr). You can extract the links from a list of dictionaries nested within result['data']['products'], as per Python example, where result is the json object retrieved from the xhr with from and to params.
Examples:
Examples using R and Python are shown below (N.B. I am less familiar with R). The above has been kept fairly language agnostic.
Bear in mind, whilst I am extracting the urls, the json returned has a lot more info including product title, price, image info etc.
Example output:
TODO:
Add in error handling
Use Session objects to benefit from re-use of underlying tcp connection especially if making multiple requests to get all products
Add in functionality to return total product number and loop structure to retrieve all (Python example might benefit from decorator)
R (a quick first go):
library(purrr)
library(stringr)
library(caTools)
library(httr)
get_links <- function(sha, start, end){
string = paste0('{"withFacets":false,"hideUnavailableItems":false,"skusFilter":"ALL_AVAILABLE","query":"148","map":"productClusterIds","orderBy":"OrderByTopSaleDESC","from":' , start , ',"to":' , end , '}')
base64encoded <- caTools::base64encode(string)
params = list(
'extensions' = paste0('{"persistedQuery":{"version":1,"sha256Hash":"' , sha , '","sender":"vtex.store-resources#0.x","provider":"vtex.search-graphql#0.x"},"variables":"' , base64encoded , '"}')
)
product_info <- content(httr::GET(url = 'https://www.exito.com/_v/segment/graphql/v1', query = params))$data$products
links <- map(product_info, ~{
.x %>% .$link
})
return(links)
}
start <- '0'
end <- '19'
sha <- httr::GET('https://exitocol.vtexassets.com/_v/public/assets/v1/published/bundle/public/react/asset.min.js?v=1&files=vtex.store-resources#0.38.0,OrderFormContext,Mutations,Queries,PWAContext&files=exitocol.store-components#0.0.2,common,11,3,SearchBar&files=vtex.responsive-values#0.2.0,common,useResponsiveValues&files=vtex.slider#0.7.3,common,0,Dots,Slide,Slider,SliderContainer&files=exito.components#4.0.7,common,0,1,3,4&workspace=master') %>%
content(., as = "text")%>% str_match(.,'query\\s+productSearch.*?hash:\\s+"(.*?)"')%>% .[[2]]
links <- get_links(sha, start, end)
print(links)
Py:
import requests, base64, re, json
def get_sha():
r = requests.get('https://exitocol.vtexassets.com/_v/public/assets/v1/published/bundle/public/react/asset.min.js?v=1&files=vtex.store-resources#0.38.0,OrderFormContext,Mutations,Queries,PWAContext&files=exitocol.store-components#0.0.2,common,11,3,SearchBar&files=vtex.responsive-values#0.2.0,common,useResponsiveValues&files=vtex.slider#0.7.3,common,0,Dots,Slide,Slider,SliderContainer&files=exito.components#4.0.7,common,0,1,3,4&workspace=master')
p = re.compile(r'query\s+productSearch.*?hash:\s+"(.*?)"') #https://regex101.com/r/VdC27H/5
sha = p.findall(r.text)[0]
return sha
def get_json(sha, start, end):
#these 'from' and 'to' values correspond with page # as pages cover batches of 20 e.g. start 20 end 39
string = '{"withFacets":false,"hideUnavailableItems":false,"skusFilter":"ALL_AVAILABLE","query":"148","map":"productClusterIds","orderBy":"OrderByTopSaleDESC","from":' + start + ',"to":' + end + '}'
base64encoded = base64.b64encode(string.encode('utf-8')).decode()
params = (('extensions', '{"persistedQuery":{"sha256Hash":"' + sha + '","sender":"vtex.store-resources#0.x","provider":"vtex.search-graphql#0.x"},"variables":"' + base64encoded + '"}'),)
r = requests.get('https://www.exito.com/_v/segment/graphql/v1',params=params)
return r.json()
def get_links(sha, start, end):
result = get_json(sha, start, end)
links = [i['link'] for i in result['data']['products']]
return links
sha = get_sha()
links = get_links(sha, '0', '19')
#print(len(links))
print(links)

How to follow a link with data-params using rvest

I'm trying to web scrape a public data provider but I got stuck when I had to click on a button passing a parameter to the JS. Here's my attempt:
require(rvest)
url <- 'https://myterna.terna.it/SunSet/Public/'
page <- url %>% read_html()
node_link <- page %>% html_node('.sub-item:nth-child(1) .postlink')
In node_link I can easily find the target page as the href of this HTML tag:
<a href="/SunSet/Public/Pubblicazioni"
class="postlink"
data-params="filter.IdSezione=52767620567B3077E053A8829B0A9478">
The point is that I cannot easily retrieve the content of the linked page because there are other buttons that point to the same link. The only difference between the various buttons is the data-params attribute which probably has to be given to the JS in order to retrieve the specific content.
Any ideas on how to solve the issue?
Obligatory heads-up:
It's not really clear if the site allow scraping, the Legal Notice says Authorization is granted for the reproduction of documents published on this website exclusively for personal use and not for commercial purposes, provided the name of source is properly indicated.
Use this respecting their terms of service.
Inspecting the network activity when clicking on that link, we can see the webpage makes a POST request to https://myterna.terna.it/SunSet/Public/Pubblicazioni/List. We can find both the requested headers and the params sent.
par <- '{"draw":1,"columns":[{"data":0,"name":"","searchable":true,"orderable":true,"search":{"value":"","regex":false}},{"data":1,"name":"","searchable":true,"orderable":true,"search":{"value":"","regex":false}},{"data":2,"name":"","searchable":false,"orderable":false,"search":{"value":"","regex":false}},{"data":3,"name":"","searchable":false,"orderable":false,"search":{"value":"","regex":false}},{"data":4,"name":"","searchable":false,"orderable":false,"search":{"value":"","regex":false}},{"data":5,"name":"","searchable":false,"orderable":false,"search":{"value":"","regex":false}},{"data":6,"name":"","searchable":false,"orderable":false,"search":{"value":"","regex":false}},{"data":7,"name":"","searchable":false,"orderable":false,"search":{"value":"","regex":false}}],"order":[],"start":0,"length":10,"search":{"value":"","regex":false},"filter":{"IdSezione":"52767620567B3077E053A8829B0A9478","Titolo":"","Id":"","ExtKey":"","TipoPubblicazione":"","SheetName":"","Anno":"2017","Mese":"7","Giorno":"","DataPubblicazione":"","TipoDatoPubblicazione":""},"details":{}}'
This is json, we can parse and change its values if we want (although I tried a few different filters and it does no respond much)
par <- jsonlite::fromJSON(par)
par$filter$Mese <- '7'
As for headers only X-Requested-With:MLHttpRequest is really needed so we can cut it down to that.
response <- POST('https://myterna.terna.it/SunSet/Public/Pubblicazioni/List',
add_headers('X-Requested-With' = 'XMLHttpRequest'),
body = par,
encode = 'json')
json_data <- content(response)$data
This returns a list, that we can safely transform to a dataframe for convenient use:
df <- data.frame(matrix(unlist(json_data), nrow=length(json_data), byrow=TRUE))
head(df, 2)
#> X1
#> 1 SbilanciamentoAggregatoZonale_SegnoGiornaliero_Orario_20170709
#> 2 SbilanciamentoAggregatoZonale_SegnoGiornaliero_QuartoOrario_20170709
#> X2
#> 1 /Date(1499680800000)/
#> 2 /Date(1499680800000)/
#> X3
#> 1 <div class="actions detail-inline export" data-pk="53F4A57FCB70304EE0532A889B0A7758"></div>
#> 2 <div class="actions detail-inline export" data-pk="53F4A57FCB6D304EE0532A889B0A7758"></div>
#> X4 X5 X6
#> 1 53F4A57FCB70304EE0532A889B0A7758 25 SEGNO_MACROZONALE_ORARIO
#> 2 53F4A57FCB6D304EE0532A889B0A7758 25 SEGNO_MACROZONALE_QUARTO_ORARIO
#> X7 X8
#> 1 Segno Giornaliero Orario
#> 2 Segno Giornaliero Quarto Orario
Ok, basicly I was missing the mechanism of how HTTP works. After some days of study I understood the correct approach is using httr package the way showed below.
First of all I retrieved all the settings needed from the public page:
lnkd_url <- paste0(dirname(dirname(url)),
node_link %>%
html_attr('href'))
lnkd_id <- strsplit(zs_node %>%
html_attr('data-params'), '=')[[1]][2]
Then it is possible to launch the POST request to the target page:
lnkd_page <- POST(lnkd_url,
body = list('filter.IdSezione' = lnkd_id)
That's it!

How to read a <li> table in a webpage

I debug the program many times to get the result as follows:
url 研究所知识库列表
/handle/1471x/1 力学研究所
/handle/1471x/8865 半导体研究所
However, no metter what parameters I use, the result is not correct. The content in this table is one part of the basis of my further analysis, and I am very trembled for it. I'm looking forward to your help with great sincerity.
## download community-list ---the 1st level of IR Grid
#loading webpage and analyzing
community_url<-"http://www.irgrid.ac.cn/community-list"
com_source <- readLines(community_url, encoding = "UTF-8")
com_parsed <- htmlTreeParse(com_source, encoding = "UTF-8", useInternalNodes = TRUE)
# get table specs
tableNodes <- getNodeSet(com_parsed, "//table")
com_tb<-readHTMLTable(tableNodes[[8]], header=TRUE)
# get External links
xpath <- "//a/#href"
getHTMLExternalFiles(tableNodes[[8]], xpQuery = xpath)
it is unclear exactly what you want your end result to look like but if you modify your xpath statements a bit to take advantage of the DOM structure you can get something like this:
library(XML)
community_url<-"http://www.irgrid.ac.cn/community-list"
com_source <- readLines(community_url, encoding = "UTF-8")
com_parsed <- htmlTreeParse(com_source, encoding = "UTF-8", useInternalNodes = TRUE)
list_header <- xpathSApply(com_parsed, '//table[.//li]//h1', xmlValue)
hrefs <- xpathSApply(com_parsed, '//li[#class="communityLink"]//#href', function(x) unname(x))
display_text <- xpathSApply(com_parsed, '//li[#class="communityLink"]//a', xmlValue)
table_data <- cbind(display_text, hrefs)
colnames(table_data) <- c(list_header, "url")
table_data
console output causes stackoverflow to think this answer is spam but here is a screen shot:

Persisting HTML documents to disk

I am trying to save about 300 HTML objects to disk using R.
str_url <- "https://www.holidayhouses.co.nz/Browse/List.aspx?page=1"
read_html_test1 <- xml2::read_html(str_url)
xml2::write_xml(read_html_test1, "testwrite.html")
read_html <- xml2::read_html("testwrite.html")
But this will eventually save about 300 separate files to disk. Ideally, what I would like is to save a single R object to disk that contains these 300 documents.
Converting each document to text before saving for some reason does not work. For example the following will product some weird (unhelpful) error:
str_html <- as.character(read_html_test1)
xml2::read_html(str_html)
If I try to use the output of xml2::read_html() it is a a pointer to a C structure and therefore this will not persist to disk.
Any suggestions for a hack to make this work...?
I managed it with the httr package, whose content function can take an as = "text" argument, which stops it from parsing the HTML.
library(xml2)
library(httr)
str_url <- "https://www.holidayhouses.co.nz/Browse/List.aspx?page=1"
# use `GET` to make the request, and pull out the html with `content`; returns text string
x <- content(GET(str_url), as = 'text')
# make a list of html documents to save
list_xs <- list(x, x)
# save list with `saveRDS`
saveRDS(list_xs, 'test.rds')
Now to see if it works:
# read in rds file we saved
saved_html <- readRDS('test.rds')
# parse the second element in it with `xml2::read_html`
saved_x_parsed <- read_html(saved_html[[2]])
# and let's see...
saved_x_parsed
# {xml_document}
# <html>
# [1] <head><title>
\n\tNew Zealand holiday homes, baches and vacation homes for rent.
\ ...
# [2] <body id="ctl00_Body" class="Page-List">
\n <div class="SatNavBarPlaceholder"/>&#13 ...
How to save R objects to disk:
Save R Objects
I took your example code and produced working, human readable, R-loadable output as follows:
str_url <- "https://www.holidayhouses.co.nz/Browse/List.aspx?page=1"
read_html_test1 <- xml2::read_html(str_url)
str_html <- as.character(read_html_test1)
x <- xml2::read_html(str_html)
save(x, file="c:\\temp\\text.txt",compress=FALSE,ascii=TRUE)