Get nodes from a html webpage to crawl URLs using R - html

https://i.stack.imgur.com/xeczg.png
I am trying to get the URLs under the node '.2lines' from the webpage 'https://www.sgcarmart.com/main/index.php'
library(rvest)
url <- read_html('https://www.sgcarmart.com/main/index.php') %>% html_nodes('.2lines') %>% html_attr()
Which I receive an error for html_nodes function:
Error in parse_simple_selector(stream) :
Expected selector, got <NUMBER '.2' at 1>
How do I get around this error?

You can use an xpath selector to find the nodes you want. The links are actually contained in <a> tags within the <p> tags you are trying to reference by class. You can access them in a single xpath:
library(rvest)
site <- 'https://www.sgcarmart.com'
urls <- site %>%
paste0("/main/index.php") %>%
read_html() %>%
html_nodes(xpath = "//*[#class = '2lines']/a") %>%
html_attr("href") %>%
{paste0(site, .)}
urls
#> [1] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12485"
#> [2] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=11875"
#> [3] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=11531"
#> [4] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=11579"
#> [5] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12635"
#> [6] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12507"
#> [7] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12644"
#> [8] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12622"
#> [9] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12650"
#> [10] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12651"
#> [11] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12589"
#> [12] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12649"

Related

No connection in R

I've been trying to learn webscraping from an online course, and they give the following as an example
url <- "https://www.canada.ca/en/employment-social-development/services/labour-relations/international/agreements.html"
website<- read_html(url)
treaties_links <- website %>% html_nodes("li") %>% html_nodes("a") %>% html_attr("href")
treaties_links <-treaties_links[23:30]
treaties_links_full <- lapply(treaties_links, function(x) (paste("https://www.canada.ca",x,sep="")))
treaties_links_full[8] <-treaties_links[8]
treaty_texts <- lapply(treaties_links_full, function(x) (read_html(x)))
When I get to this last line it returns an error
Error in open.connection(x, "rb") :
Could not resolve host: www.canada.cahttp
Your error is in your lapply() code. If you print treaties_links, you will see that they are not all internal links, i.e. links starting with /, and some are links to other domains:
print(treaties_links)
[1] "/en/employment-social-development/services/labour-relations/international/agreements/chile.html"
[2] "/en/employment-social-development/services/labour-relations/international/agreements/costa-rica.html"
[3] "/en/employment-social-development/services/labour-relations/international/agreements/peru.html"
[4] "/en/employment-social-development/services/labour-relations/international/agreements/colombia.html"
[5] "/en/employment-social-development/services/labour-relations/international/agreements/jordan.html"
[6] "/en/employment-social-development/services/labour-relations/international/agreements/panama.html"
[7] "http://www.international.gc.ca/trade-agreements-accords-commerciaux/agr-acc/honduras/labour-travail.aspx?lang=eng"
[8] "http://international.gc.ca/trade-commerce/assets/pdfs/agreements-accords/korea-coree/18_CKFTA_EN.pdf"
This means that when you are running paste("https://www.canada.ca",x,sep="") on e.g. link 7, you get:
"https://www.canada.cahttp://www.international.gc.ca/trade-agreements-accords-commerciaux/agr-acc/honduras/labour-travail.aspx?lang=eng"
Assuming you want to keep that link you might change your lapply to:
treaties_links_full <- lapply(
treaties_links,
function(x) {
ifelse(
substr(x,1,1)=="/",
paste("https://www.canada.ca",x,sep=""),
x
)
}
)
This will only prepend "https://www.canada.ca" to the links within that domain.

how can i select only the links when i bring up the entire class to R?

library:
library(rvest)
library(dplyr)
library(stringr)
library(purrr)
when i use this code i bring up the entire class of the html for site:
links_avai <- paste0("https://avai.com.br/page", seq(from = 1, to = 2)) %>%
map(. %>%
read_html() %>%
html_nodes(xpath = '//*[#class="gdlr-blog-title"]')
runnig it i have te follow result:
[[1]]
{xml_nodeset (8)}
[1] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/entenda-como-funciona-o-processo-de-apresentacao-d ...
[2] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/ingressos-a-venda-para-avai-x-barra-3a-rodada-do-c ...
[3] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/dona-nesi-furlani-recebe-homenagem-do-avai/">Dona ...
[4] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/avai-e-superado-pela-chapecoense-na-ressacada/">Av ...
[5] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/edital-de-convocacao-reuniao-extraordinaria-do-con ...
[6] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/catarinense-2022-confira-o-guia-da-partida-avai-x- ...
[7] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/avai-finaliza-preparacao-para-enfrentar-a-chapecoe ...
[8] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/catarinense-2022-arbitragem-para-avai-x-chapecoens ..
whit that in mind how can improve my code to selecet only the link from the class?
i alredy tried that code below, but it did not work
links_avai <- paste0("https://avai.com.br/page", seq(from = 1, to = 2)) %>%
map(. %>%
read_html() %>%
html_nodes(xpath = '//*[#class="gdlr-blog-title"]') %>%
html_element("href")
the result was:
{xml_nodeset (8)}
[1] <NA>
[2] <NA>
[3] <NA>
[4] <NA>
[5] <NA>
[6] <NA>
[7] <NA>
[8] <NA>
To get the links use html_attrs, the links are attached to node/element a.
url<-"https://www.avai.com.br/novo/"
url %>%
read_html() %>%
html_nodes('.gdlr-blog-title') %>% html_nodes('a') %>%
html_attr('href')
[1] "https://www.avai.com.br/novo/se-e-baya-e-bom-atacante-paulo-baya-e-apresentado-no-leao/"
[2] "https://www.avai.com.br/novo/comunicado-arquivada-denuncia-no-stjd/"
[3] "https://www.avai.com.br/novo/sob-chuva-leao-se-reapresenta-nesta-tarde-de-quinta-feira/"
[4] "https://www.avai.com.br/novo/entenda-como-funciona-o-processo-de-apresentacao-de-atletas/"
[5] "https://www.avai.com.br/novo/ingressos-a-venda-para-avai-x-barra-3a-rodada-do-catarinense-fort-2022/"
[6] "https://www.avai.com.br/novo/dona-nesi-furlani-recebe-homenagem-do-avai/"
[7] "https://www.avai.com.br/novo/avai-e-superado-pela-chapecoense-na-ressacada/"
[8] "https://www.avai.com.br/novo/edital-de-convocacao-reuniao-extraordinaria-do-conselho-deliberativo-11/"

Scraping a table from OECD

I'm trying to scrape a table from https://data.oecd.org/unemp/unemployment-rate.htm and my table in specific https://data.oecd.org/chart/66NJ. I want to scrape the months at the top and all the values in the rows 'OECD - Total' and 'The Netherlands'
After trying many different code and searching on this and other forums I just can't figure out how to scrape from this table. I have tried many different html codes found via selector gadget or inspecting an element in my browser but keep getting 'list of 0' or 'character empty'
Any help would be appreciated.
library(tidyverse)
library(rvest)
library(XML)
library(magrittr)
#Get element data from one page
url<-"https://stats.oecd.org/sdmx-json/data/DP_LIVE/.HUR.TOT.PC_LF.M/OECD?json-lang=en&dimensionAtObservation=allDimensions&startPeriod=2016-08&endPeriod=2020-07"
#scrape all elements
content <- read_html(url)
#trying to load in a table (giveslist of 0)
inladentable <- readHTMLTable(url)
#gather al months (gives charahter 'empty')
months <- content %>%
html_nodes(".table-chart-sort-link") %>%
html_table()
#alle waarden voor de rij 'OECD - Total' verzamelen
wwpercentage<- content %>%
html_nodes(".table-chart-has-status-e") %>%
html_text()
# Combine into a tibble
wwtable <- tibble(months=months,wwpercentage=wwpercentage)
This is JSON and not HTML.
You can query it using httr and jsonlite:
library(httr)
res <- GET("https://stats.oecd.org/sdmx-json/data/DP_LIVE/.HUR.TOT.PC_LF.M/OECD?json-lang=en&dimensionAtObservation=allDimensions&startPeriod=2016-08&endPeriod=2020-07")
res <- jsonlite::fromJSON(content(res,as='text'))
res
#> $header
#> $header$id
#> [1] "98b762f3-47aa-4e28-978a-a4a6f6b3995a"
#>
#> $header$test
#> [1] FALSE
#>
#> $header$prepared
#> [1] "2020-09-30T21:58:10.5763805Z"
#>
#> $header$sender
#> $header$sender$id
#> [1] "OECD"
#>
#> $header$sender$name
#> [1] "Organisation for Economic Co-operation and Development"
#>
#>
#> $header$links
#> href
#> 1 https://stats.oecd.org:443/sdmx-json/data/DP_LIVE/.HUR.TOT.PC_LF.M/OECD?json-lang=en&dimensionAtObservation=allDimensions&startPeriod=2016-08&endPeriod=2020-07
#> rel
#> 1 request
#>
#>
#> $dataSets
#> action observations.0:0:0:0:0:0 observations.0:0:0:0:0:1
#> 1 Information 5.600849, 0.000000, NA 5.645914, 0.000000, NA
...

R Webscraping RCurl and httr Content

I'm learning a bit about webscraping and I'm having a little doubt regarding 2 packages (httr and RCurl), I'm trying to get a code from a magazine (ISSN) on the researchgate website and I came across a situation. When extracting the content from the site by httr and RCurl, I get the ISSN in the RCurl package and in httr my function is returning NULL, could anyone tell me why this? in my opinion it was for both functions to be working. Follow the code below.
library(rvest)
library(httr)
library(RCurl)
url <- "https://www.researchgate.net/journal/0730-0301_Acm_Transactions_On_Graphics"
########
# httr #
########
conexao <- GET(url)
conexao_status <- http_status(conexao)
conexao_status
content(conexao, as = "text", encoding = "utf-8") %>% read_html() -> webpage1
ISSN <- webpage1 %>%
html_nodes(xpath = '//*/div/div[2]/div[1]/div[1]/table[2]/tbody/tr[7]/td') %>%
html_text %>%
str_to_title() %>%
str_split(" ") %>%
unlist
ISSN
########
# RCurl #
########
options(RCurlOptions = list(verbose = FALSE,
capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"),
ssl.verifypeer = FALSE))
webpage <- getURLContent(url) %>% read_html()
ISSN <- webpage %>%
html_nodes(xpath = '//*/div/div[2]/div[1]/div[1]/table[2]/tbody/tr[7]/td') %>%
html_text %>%
str_to_title() %>%
str_split(" ") %>%
unlist
ISSN
sessionInfo() R version 3.5.0 (2018-04-23) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build
9200)
Matrix products: default
locale: [1] LC_COLLATE=Portuguese_Brazil.1252
LC_CTYPE=Portuguese_Brazil.1252 LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C LC_TIME=Portuguese_Brazil.1252
attached base packages: [1] stats graphics grDevices utils
datasets methods base
other attached packages: [1] testit_0.7 dplyr_0.7.4
progress_1.1.2 readxl_1.1.0 stringr_1.3.0 RCurl_1.95-4.10
bitops_1.0-6 [8] httr_1.3.1 rvest_0.3.2 xml2_1.2.0
jsonlite_1.5
loaded via a namespace (and not attached): [1] Rcpp_0.12.16
bindr_0.1.1 magrittr_1.5 R6_2.2.2 rlang_0.2.0
tools_3.5.0 [7] yaml_2.1.19 assertthat_0.2.0
tibble_1.4.2 bindrcpp_0.2.2 curl_3.2 glue_1.2.0
[13] stringi_1.1.7 pillar_1.2.2 compiler_3.5.0
cellranger_1.1.0 prettyunits_1.0.2 pkgconfig_2.0.1
Because the content type is JSON and not HTML, you can't use read_html() on it:
> conexao
Response [https://www.researchgate.net/journal/0730-0301_Acm_Transactions_On_Graphics]
Date: 2018-06-02 03:15
Status: 200
Content-Type: application/json; charset=utf-8
Size: 328 kB
Use fromJSON() instead to extract issn:
library(jsonlite)
result <- fromJSON(content(conexao, as = "text", encoding = "utf-8") )
result$result$data$journalFullInfo$data$issn
result:
> result$result$data$journalFullInfo$data$issn
[1] "0730-0301"

web scraping html in R

I want get the URL list from scraping http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm like this:
[1] "P-Obama-Inaugural-Speech-Inauguration.htm"
[2] "E11-Barack-Obama-Election-Night-Victory-Speech-Grant-Park-Illinois-November-4-2008.htm"
and this is my code:
library(XML)
url = "http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm"
doc = htmlTreeParse(url, useInternalNodes = T)
url.list = xpathSApply(doc, "//a[contains(#href, 'htm')]")
The problem is that I want to unlist() url.list so I can strsplit it but it doesn't unlist.
One more step ought to do it (just need to get the href attribute):
library(XML)
url <- "http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm"
doc <- htmlTreeParse(url, useInternalNodes=TRUE)
url.list <- xpathSApply(doc, "//a[contains(#href, 'htm')]")
hrefs <- gsub("^/", "", sapply(url.list, xmlGetAttr, "href"))
head(hrefs, 6)
## [1] "P-Obama-Inaugural-Speech-Inauguration.htm"
## [2] "E11-Barack-Obama-Election-Night-Victory-Speech-Grant-Park-Illinois-November-4-2008.htm"
## [3] "E11-Barack-Obama-Election-Night-Victory-Speech-Grant-Park-Illinois-November-4-2008.htm"
## [4] "E-Barack-Obama-Speech-Manassas-Virgina-Last-Rally-2008-Election.htm"
## [5] "E10-Barack-Obama-The-American-Promise-Acceptance-Speech-at-the-Democratic-Convention-Mile-High-Stadium--Denver-Colorado-August-28-2008.htm"
## [6] "E10-Barack-Obama-The-American-Promise-Acceptance-Speech-at-the-Democratic-Convention-Mile-High-Stadium--Denver-Colorado-August-28-2008.htm"
free(doc)
UPDATE Obligatory rvest + dplyr way:
library(rvest)
library(dplyr)
speeches <- html("http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm")
speeches %>% html_nodes("a[href*=htm]") %>% html_attr("href") %>% head(6)
## same output as above