Usually when scraping websites, I use "SelectorGadget". If not, I would have to inspect some elements on a page.
However, I am running in to a bit of trouble when trying to scrape this one website.
The HTML looks like this:
<div class="col-span-2 mt-16 sm:mt-4 flex justify-between sm:block space-x-12 font-bold"><span>103 m²</span><span>8 650 000 kr</span></div>
Elements that I want:
<span>103 m²</span>
</span><span>8 650 000 kr</span></div>
They look like this:
103 m²
8 650 000 kr
My simple R code:
# The URL
url = "https://www.finn.no/realestate/homes/search.html?page=%d&sort=PUBLISHED_DESC"
page_outside <- read_html(sprintf(url,1))
element_1 <- page %>% html_nodes("x") %>% html_text()
Anyone got any tips or ideas on how I can access these?
thanks!
Here is a possibility, parse out span nodes under a div with class of "justify-between".
url = "https://www.finn.no/realestate/homes/search.html?page=%d&sort=PUBLISHED_DESC"
page_outside <- read_html(sprintf(url,1))
element_1 <- page_outside %>% html_elements("div.justify-between span")
element_1
{xml_nodeset (100)}
[1] <span>47 m²</span>
[2] <span>3 250 000 kr</span>
[3] <span>102 m²</span>
[4] <span>2 400 000 kr</span>
[5] <span>100 m²</span>
[6] <span>10 000 000 kr</span>
[7] <span>122 m²</span>
[8] <span>9 950 000 kr</span>
[9] <span>90 m²</span>
[10] <span>4 790 000 kr</span>
...
Update
If the is some missing data then a slightly longer solution is need to track which element is missing
divs <- page_outside %>% html_elements("div.justify-between")
answer <-lapply(divs, function(node) {
values <- node %>% html_elements("span") %>% html_text()
if (length(values)==2)
{
results <- t(values)
} else if (grepl("kr", values) ) {
results <- c(NA, values)
} else {
results <- c(values, NA)
}
results
})
answer <- do.call(rbind, answer)
answer
[,1] [,2]
[1,] "87 m²" "2 790 000 kr"
[2,] "124 m²" "5 450 000 kr"
[3,] "105 m²" "4 500 000 kr"
[4,] "134 m²" "1 500 000 kr"
Related
library:
library(rvest)
library(dplyr)
library(stringr)
library(purrr)
when i use this code i bring up the entire class of the html for site:
links_avai <- paste0("https://avai.com.br/page", seq(from = 1, to = 2)) %>%
map(. %>%
read_html() %>%
html_nodes(xpath = '//*[#class="gdlr-blog-title"]')
runnig it i have te follow result:
[[1]]
{xml_nodeset (8)}
[1] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/entenda-como-funciona-o-processo-de-apresentacao-d ...
[2] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/ingressos-a-venda-para-avai-x-barra-3a-rodada-do-c ...
[3] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/dona-nesi-furlani-recebe-homenagem-do-avai/">Dona ...
[4] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/avai-e-superado-pela-chapecoense-na-ressacada/">Av ...
[5] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/edital-de-convocacao-reuniao-extraordinaria-do-con ...
[6] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/catarinense-2022-confira-o-guia-da-partida-avai-x- ...
[7] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/avai-finaliza-preparacao-para-enfrentar-a-chapecoe ...
[8] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/catarinense-2022-arbitragem-para-avai-x-chapecoens ..
whit that in mind how can improve my code to selecet only the link from the class?
i alredy tried that code below, but it did not work
links_avai <- paste0("https://avai.com.br/page", seq(from = 1, to = 2)) %>%
map(. %>%
read_html() %>%
html_nodes(xpath = '//*[#class="gdlr-blog-title"]') %>%
html_element("href")
the result was:
{xml_nodeset (8)}
[1] <NA>
[2] <NA>
[3] <NA>
[4] <NA>
[5] <NA>
[6] <NA>
[7] <NA>
[8] <NA>
To get the links use html_attrs, the links are attached to node/element a.
url<-"https://www.avai.com.br/novo/"
url %>%
read_html() %>%
html_nodes('.gdlr-blog-title') %>% html_nodes('a') %>%
html_attr('href')
[1] "https://www.avai.com.br/novo/se-e-baya-e-bom-atacante-paulo-baya-e-apresentado-no-leao/"
[2] "https://www.avai.com.br/novo/comunicado-arquivada-denuncia-no-stjd/"
[3] "https://www.avai.com.br/novo/sob-chuva-leao-se-reapresenta-nesta-tarde-de-quinta-feira/"
[4] "https://www.avai.com.br/novo/entenda-como-funciona-o-processo-de-apresentacao-de-atletas/"
[5] "https://www.avai.com.br/novo/ingressos-a-venda-para-avai-x-barra-3a-rodada-do-catarinense-fort-2022/"
[6] "https://www.avai.com.br/novo/dona-nesi-furlani-recebe-homenagem-do-avai/"
[7] "https://www.avai.com.br/novo/avai-e-superado-pela-chapecoense-na-ressacada/"
[8] "https://www.avai.com.br/novo/edital-de-convocacao-reuniao-extraordinaria-do-conselho-deliberativo-11/"
I'm trying to scrape a table from https://data.oecd.org/unemp/unemployment-rate.htm and my table in specific https://data.oecd.org/chart/66NJ. I want to scrape the months at the top and all the values in the rows 'OECD - Total' and 'The Netherlands'
After trying many different code and searching on this and other forums I just can't figure out how to scrape from this table. I have tried many different html codes found via selector gadget or inspecting an element in my browser but keep getting 'list of 0' or 'character empty'
Any help would be appreciated.
library(tidyverse)
library(rvest)
library(XML)
library(magrittr)
#Get element data from one page
url<-"https://stats.oecd.org/sdmx-json/data/DP_LIVE/.HUR.TOT.PC_LF.M/OECD?json-lang=en&dimensionAtObservation=allDimensions&startPeriod=2016-08&endPeriod=2020-07"
#scrape all elements
content <- read_html(url)
#trying to load in a table (giveslist of 0)
inladentable <- readHTMLTable(url)
#gather al months (gives charahter 'empty')
months <- content %>%
html_nodes(".table-chart-sort-link") %>%
html_table()
#alle waarden voor de rij 'OECD - Total' verzamelen
wwpercentage<- content %>%
html_nodes(".table-chart-has-status-e") %>%
html_text()
# Combine into a tibble
wwtable <- tibble(months=months,wwpercentage=wwpercentage)
This is JSON and not HTML.
You can query it using httr and jsonlite:
library(httr)
res <- GET("https://stats.oecd.org/sdmx-json/data/DP_LIVE/.HUR.TOT.PC_LF.M/OECD?json-lang=en&dimensionAtObservation=allDimensions&startPeriod=2016-08&endPeriod=2020-07")
res <- jsonlite::fromJSON(content(res,as='text'))
res
#> $header
#> $header$id
#> [1] "98b762f3-47aa-4e28-978a-a4a6f6b3995a"
#>
#> $header$test
#> [1] FALSE
#>
#> $header$prepared
#> [1] "2020-09-30T21:58:10.5763805Z"
#>
#> $header$sender
#> $header$sender$id
#> [1] "OECD"
#>
#> $header$sender$name
#> [1] "Organisation for Economic Co-operation and Development"
#>
#>
#> $header$links
#> href
#> 1 https://stats.oecd.org:443/sdmx-json/data/DP_LIVE/.HUR.TOT.PC_LF.M/OECD?json-lang=en&dimensionAtObservation=allDimensions&startPeriod=2016-08&endPeriod=2020-07
#> rel
#> 1 request
#>
#>
#> $dataSets
#> action observations.0:0:0:0:0:0 observations.0:0:0:0:0:1
#> 1 Information 5.600849, 0.000000, NA 5.645914, 0.000000, NA
...
https://i.stack.imgur.com/xeczg.png
I am trying to get the URLs under the node '.2lines' from the webpage 'https://www.sgcarmart.com/main/index.php'
library(rvest)
url <- read_html('https://www.sgcarmart.com/main/index.php') %>% html_nodes('.2lines') %>% html_attr()
Which I receive an error for html_nodes function:
Error in parse_simple_selector(stream) :
Expected selector, got <NUMBER '.2' at 1>
How do I get around this error?
You can use an xpath selector to find the nodes you want. The links are actually contained in <a> tags within the <p> tags you are trying to reference by class. You can access them in a single xpath:
library(rvest)
site <- 'https://www.sgcarmart.com'
urls <- site %>%
paste0("/main/index.php") %>%
read_html() %>%
html_nodes(xpath = "//*[#class = '2lines']/a") %>%
html_attr("href") %>%
{paste0(site, .)}
urls
#> [1] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12485"
#> [2] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=11875"
#> [3] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=11531"
#> [4] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=11579"
#> [5] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12635"
#> [6] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12507"
#> [7] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12644"
#> [8] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12622"
#> [9] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12650"
#> [10] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12651"
#> [11] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12589"
#> [12] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12649"
I've searched through many rvest scraping posts but can't find an example like mine. I'm following the R vignette example (https://blog.rstudio.com/2014/11/24/rvest-easy-web-scraping-with-r/) for selectorgadget, but inputting my use case as necessary. None of selector gadget's suggestions get me what I need. I need to extract the name for each review on the page. A sample of what the name looks like under the hood is as follows:
<span itemprop="name" class="sg_selected">This Name</span>
Here's my code to this point. Ideally, this code should get me the individual names on this web page.
library(rvest)
library(dplyr)
dsa_reviews <-
read_html("https://www.directsalesaid.com/companies/traveling-
vineyard#reviews")
review_names <- html_nodes(dsa_reviews,'#reviews span')
df <- bind_rows(lapply(xml_attrs(review_names), function(x)
data.frame(as.list(x), stringsAsFactors=FALSE)))
Apologies if this is a duplicate question or if it's not formatted correctly. Please feel free to request any necessary edits.
Here it is :
library(rvest)
library(dplyr)
dsa_reviews <-
read_html("https://www.directsalesaid.com/companies/traveling-vineyard#reviews")
html_nodes(dsa_reviews,'[itemprop=name]') %>%
html_text()
[1] "Traveling Vineyard" ""
[3] "Kiersten Ray-kuhn" "Miley Sama"
[5] " Nancy Shawtone " "Amanda Moore"
[7] "Matt" "Kathy Barzal"
[9] "Lesa Brinker" "Lori Stryker"
[11] "Jeanette Holtman" "Penny Notarnicola"
[13] "Laura Ann" "Nicole Lafave"
[15] "Gretchen Hess Miller" "Gina Devine"
[17] "Ashley Lawton Converse" "Morgan Williams"
[19] "Angela Baston Mckeone" "Traci Feshler"
[21] "Kisha Marshall Dlugos" "Jody Cole Dvorak"
Colin
I'm trying my hand at scraping tables from Wikipedia and I'm reaching an impasse. I'm using the squads of the FIFA 2014 World Cup as an example. In this case, I want to extract the list of the participating countries from the table of the contents from the page "2014 FIFA World Cup squads" and store them as a vector. Here's how far I got:
library(tidyverse)
library(rvest)
library(XML)
library(RCurl)
(Countries <- read_html("https://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads") %>%
html_node(xpath = '//*[#id="toc"]/ul') %>%
htmlTreeParse() %>%
xmlRoot())
This spits out a bunch of HTML code that I won't copy/paste here. I specifically am looking to extract all lines with the tag <span class="toctext"> such as "Group A", "Brazil", "Cameroon", etc. and have them saved as a vector. What function would make this happen?
You can read the text from a node using html_text()
url <- "https://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads"
toc <- url %>%
read_html() %>%
html_node(xpath = '//*[#id="toc"]') %>%
html_text()
This gives you a single character vector. You can then split on the \n character to give you the results as a vector (and you can clean out the blanks)
contents <- strsplit(toc, "\n")[[1]]
contents[contents != ""]
# [1] "Contents" "1 Group A" "1.1 Brazil"
# [4] "1.2 Cameroon" "1.3 Croatia" "1.4 Mexico"
# [7] "2 Group B" "2.1 Australia" "2.2 Chile"
# [10] "2.3 Netherlands" "2.4 Spain" "3 Group C"
# [13] "3.1 Colombia" "3.2 Greece" "3.3 Ivory Coast"
# [16] "3.4 Japan" "4 Group D" "4.1 Costa Rica"
# [19] "4.2 England" "4.3 Italy" "4.4 Uruguay"
# ---
# etc
Generally, to read tables in an html document you can use the html_table() function, but in this case the table of contents isn't read.
url %>%
read_html() %>%
html_table()