html_attr doesn't the 'href' attr - html

First of all I'm really beginner with web-scraping.
So work on this wesite. I try to get the links to the next webpages with discussion about the espisode. With the SelectorGadget I managed to get only the part of the html with the frame with the topics
html.s1e01 <- html("http://asoiaf.westeros.org/index.php/forum/41-e01-winter-is-coming/")
html.s1e01.page <- html_nodes(html.s1e01, ".ipsBox")
Now I want to get all links to the topics, so I tried
html_attr(html.s1e01.page, "href")
but I get NA. I saw similar examples on the Internet and it should work. Any suggestion why it does not?

html.s1e01.page <- html_nodes(html.s1e01, ".ipsBox .topic_title")
html.s1e01.topics <- html.s1e01.page %>% html_attr("href")
html.s1e01.topics
## [1] "http://asoiaf.westeros.org/index.php/topic/49408-poll-how-would-you-rate-episode-101/"
## [2] "http://asoiaf.westeros.org/index.php/topic/109202-death-of-john-aryn-season-4-episode-5-spoilers/"
## [3] "http://asoiaf.westeros.org/index.php/topic/49310-book-spoilers-episode-101-take-3/"
## [4] "http://asoiaf.westeros.org/index.php/topic/90902-sir-john-standingjonarryn/"
## [5] "http://asoiaf.westeros.org/index.php/topic/106105-did-anyone-notice-the-color-of-the-feather-in-lyannas-tomb/"
## [6] "http://asoiaf.westeros.org/index.php/topic/49116-book-tv-spoilers-what-was-left-out-and-what-was-left-in/"
## [7] "http://asoiaf.westeros.org/index.php/topic/49070-no-spoilers-ep101-discussion/"
## [8] "http://asoiaf.westeros.org/index.php/topic/49159-book-spoilers-the-book-was-better/"
## [9] "http://asoiaf.westeros.org/index.php/topic/57614-runes-in-agot-spoilers-i-suppose/"
## [10] "http://asoiaf.westeros.org/index.php/topic/49151-book-spoilers-ep101-discussion-mark-ii/"
## [11] "http://asoiaf.westeros.org/index.php/topic/49161-booktv-spoilers-dany-drogo/"
## [12] "http://asoiaf.westeros.org/index.php/topic/49071-book-spoilers-ep101-discussion/"
## [13] "http://asoiaf.westeros.org/index.php/topic/49100-no-spoilers-pre-airing-discussion/"

Related

Recognizing and Keeping Elements Containing Certain Patterns in a List

I want to try and webscrape my own Stackoverflow Profiles! By this I mean, get an html link of every question I have ever asked:
https://stackoverflow.com/users/18181916/antonoyaro8
https://math.stackexchange.com/users/1024449/antonoyaro8
I tried to do this follows:
library(rvest)
library(httr)
library(XML)
url<-"https://stackoverflow.com/users/18181916/antonoyaro8?tab=questions&sort=newest"
page <-read_html(url)
resource <- GET(url)
parse <- htmlParse(resource)
links <- list(xpathSApply(parse, path="//a", xmlGetAttr, "href"))
I tried to pick up on a pattern and noticed that all links with questions have some number - so I tried to write a code that checks if elements in the list contain a number and keep these links:
rv <- c("1", "2", "3", "4", "5", "6", "7", "8", "9", "0")
final <- unique (grep(paste(rv,collapse="|"),
links, value=TRUE))
But I don't think I am doing this correctly - apart from the messy formatting, the final file is returning links that do not contain any numbers at all.
Can someone please show me how to webscrape these links properly, and then repeat this for all pages (e.g. https://stackoverflow.com/users/18181916/antonoyaro8?tab=questions&sort=newest, https://stackoverflow.com/users/18181916/antonoyaro8?tab=questions&sort=newest&page=2, https://stackoverflow.com/users/18181916/antonoyaro8?tab=questions&sort=newest&page=3)
Worse come to worst, if I can do it for one of these pages, I can manually copy/paste the code for all pages and proceed that way.
Thank you!
The output is a list of length 1. We need to extract ([[) the element before applying the grep
unique (grep(paste(rv,collapse="|"),
links[[1]], value=TRUE))
Note that the rv includes numbers 0 to 9 and it can match a digit if it is present anywhere in the link. If the intention is to subset the digits following the questions
grep("questions/\\d+", links[[1]], value = TRUE)
-output
[1] "/questions/72859976/recognizing-and-keeping-elements-containing-certain-patterns-in-a-list"
[2] "/questions/72843570/combing-two-selections-together"
[3] "/questions/72840913/selecting-rows-from-a-table-based-on-a-list"
[4] "/questions/72840624/even-out-table-in-r"
[5] "/questions/72840548/creating-a-dictionary-reference-table"
[6] "/questions/72837147/sequentially-replacing-factor-variables-with-numerical-values"
[7] "/questions/72822951/scanning-and-replacing-values-of-rows-in-r"
[8] "/questions/72822781/alternative-to-do-callrbind-data-frame-for-combining-a-list-of-data-frames"
[9] "/questions/72738885/referencing-a-query-in-another-query"
[10] "/questions/72725108/defining-cte-common-table-expressions-in-r"
[11] "/questions/72723768/creating-an-id-variable-on-the-spot"
[12] "/questions/72720013/selecting-data-using-conditions-stored-in-a-variable"
[13] "/questions/72717135/effecient-ways-to-append-sql-results-in-r"
...
If there are multiple pages, add the page= with paste or sprintf
urls <- c(url, sprintf("%s&page=%d", url, 2:3))
out_lst <- lapply(urls, function(url)
{
page <-read_html(url)
resource <- GET(url)
parse <- htmlParse(resource)
links <- list(xpathSApply(parse, path="//a", xmlGetAttr, "href"))
grep("questions/\\d+", links[[1]], value = TRUE)
})
-output
> out_lst
[[1]]
[1] "/questions/72859976/recognizing-and-keeping-elements-containing-certain-patterns-in-a-list"
[2] "/questions/72843570/combing-two-selections-together"
[3] "/questions/72840913/selecting-rows-from-a-table-based-on-a-list"
[4] "/questions/72840624/even-out-table-in-r"
[5] "/questions/72840548/creating-a-dictionary-reference-table"
[6] "/questions/72837147/sequentially-replacing-factor-variables-with-numerical-values"
[7] "/questions/72822951/scanning-and-replacing-values-of-rows-in-r"
[8] "/questions/72822781/alternative-to-do-callrbind-data-frame-for-combining-a-list-of-data-frames"
[9] "/questions/72738885/referencing-a-query-in-another-query"
[10] "/questions/72725108/defining-cte-common-table-expressions-in-r"
[11] "/questions/72723768/creating-an-id-variable-on-the-spot"
[12] "/questions/72720013/selecting-data-using-conditions-stored-in-a-variable"
[13] "/questions/72717135/effecient-ways-to-append-sql-results-in-r"
[14] "/questions/72710448/removing-files-from-global-environment-with-a-certain-pattern"
[15] "/questions/72710203/r-sql-is-the-default-option-sampling-with-replacement"
[16] "/questions/72695401/allocating-max-memory-in-r"
[17] "/questions/72681898/randomly-delete-columns-from-datasets"
[18] "/questions/72663516/are-rds-files-more-efficient-than-csv-files"
[19] "/questions/72625690/importing-files-using-list-files"
[20] "/questions/72623856/second-most-common-element-in-each-row"
[21] "/questions/72623744/counting-the-position-where-a-pattern-is-completed"
[22] "/questions/72620501/bulk-import-export-files-from-r"
[23] "/questions/72613413/counting-every-position-where-a-pattern-appears"
[24] "/questions/72612577/counting-the-position-of-the-first-0-in-each-row"
[25] "/questions/72607160/taking-averages-across-lists"
[26] "/questions/72589276/functions-for-finding-out-the-midpoint-interpolation"
[27] "/questions/72587298/sandwiching-values-between-rows"
[28] "/questions/72569338/integration-error-lengthlower-1-is-not-true"
[29] "/questions/72568817/synchronizing-nas-in-r"
[30] "/questions/72568661/finding-the-loser-in-each-row"
[[2]]
[1] "/questions/72566170/making-a-race-between-two-variables"
[2] "/questions/72418723/making-a-list-of-random-numbers"
[3] "/questions/72418364/random-uniform-numbers-without-runif"
[4] "/questions/72353102/integrate-normal-distribution-between-2-values"
[5] "/questions/72174868/placing-commas-between-names"
[6] "/questions/72163297/simulate-flipping-french-fries-in-r"
[7] "/questions/71982286/alternatives-to-the-partition-by-statement-in-sql"
[8] "/questions/71970960/converting-lists-into-data-frames"
[9] "/questions/71970672/random-numbers-are-too-similar-to-each-other"
[10] "/questions/71933753/making-combinations-of-items"
[11] "/questions/71874791/sorting-rows-in-specified-order"
[12] "/questions/71866097/hiding-the-legend-in-this-graph"
[13] "/questions/71866048/understanding-the-median-in-this-graph"
[14] "/questions/71852517/nas-produced-when-number-of-iterations-increase"
[15] "/questions/71791906/assigning-unique-colors-to-multiple-lines-on-a-graph"
[16] "/questions/71787336/finding-identical-rows-in-multiple-datasets"
[17] "/questions/71758983/multiple-replace-lookups"
[18] "/questions/71758648/create-ascending-id-in-a-data-frame"
[19] "/questions/71731208/webscraping-data-which-pokemon-can-learn-which-attacks"
[20] "/questions/71728273/webscraping-pokemon-data"
[21] "/questions/71683045/identifying-smallest-element-in-each-row-of-a-matrix"
[22] "/questions/71671488/connecting-all-nodes-together-on-a-graph"
[23] "/questions/71641774/overriding-colors-in-ggplot2"
[24] "/questions/71641404/applying-a-function-to-a-data-frame-lapply-vs-traditional-way"
[25] "/questions/71624111/sending-emails-from-r"
[26] "/questions/71623019/sql-joining-tables-from-2-different-servers-r-vs-sas"
[27] "/questions/71429265/overriding-sql-errors-during-r-uploads"
[28] "/questions/71429129/splitting-a-dataset-into-uneven-portions"
[29] "/questions/71418533/multiplying-and-adding-values-across-rows"
[30] "/questions/71417489/tricking-an-sql-server-to-accept-a-file-from-r"
[[3]]
[1] "/questions/71417218/splitting-a-dataset-into-arbitrary-sections"
[2] "/questions/71398804/plotting-vector-fields-and-gradient-fields"
[3] "/questions/71387596/animating-the-mandelbrot-set"
[4] "/questions/71358405/repeat-a-set-of-ids-for-every-n-rows"
[5] "/questions/71344822/time-series-graphs-with-different-symbols"
[6] "/questions/71341865/creating-a-data-frame-with-commas"
[7] "/questions/71287944/converting-igraph-to-visnetwork"
[8] "/questions/71282863/fixing-the-first-and-last-numbers-in-a-random-list"
[9] "/questions/71282403/adding-labels-to-graph-nodes"
[10] "/questions/71262761/understanding-list-and-do-call-commands"
[11] "/questions/71261431/adjusting-graph-layouts"
[12] "/questions/71255038/overriding-non-existent-components-in-a-loop"
[13] "/questions/71244872/fixing-cluttered-titles-on-graphs"
[14] "/questions/71243676/directly-adding-titles-and-labels-to-visnetwork"
[15] "/questions/71232353/removing-all-edges-in-igraph"
[16] "/questions/71230273/writing-a-function-that-references-elements-in-a-matrix"
[17] "/questions/71227260/generating-random-graphs-according-to-some-conditions"
[18] "/questions/71087349/adding-combinations-of-numbers-in-a-matrix"

how can i select only the links when i bring up the entire class to R?

library:
library(rvest)
library(dplyr)
library(stringr)
library(purrr)
when i use this code i bring up the entire class of the html for site:
links_avai <- paste0("https://avai.com.br/page", seq(from = 1, to = 2)) %>%
map(. %>%
read_html() %>%
html_nodes(xpath = '//*[#class="gdlr-blog-title"]')
runnig it i have te follow result:
[[1]]
{xml_nodeset (8)}
[1] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/entenda-como-funciona-o-processo-de-apresentacao-d ...
[2] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/ingressos-a-venda-para-avai-x-barra-3a-rodada-do-c ...
[3] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/dona-nesi-furlani-recebe-homenagem-do-avai/">Dona ...
[4] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/avai-e-superado-pela-chapecoense-na-ressacada/">Av ...
[5] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/edital-de-convocacao-reuniao-extraordinaria-do-con ...
[6] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/catarinense-2022-confira-o-guia-da-partida-avai-x- ...
[7] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/avai-finaliza-preparacao-para-enfrentar-a-chapecoe ...
[8] <h3 class="gdlr-blog-title"><a href="https://www.avai.com.br/novo/catarinense-2022-arbitragem-para-avai-x-chapecoens ..
whit that in mind how can improve my code to selecet only the link from the class?
i alredy tried that code below, but it did not work
links_avai <- paste0("https://avai.com.br/page", seq(from = 1, to = 2)) %>%
map(. %>%
read_html() %>%
html_nodes(xpath = '//*[#class="gdlr-blog-title"]') %>%
html_element("href")
the result was:
{xml_nodeset (8)}
[1] <NA>
[2] <NA>
[3] <NA>
[4] <NA>
[5] <NA>
[6] <NA>
[7] <NA>
[8] <NA>
To get the links use html_attrs, the links are attached to node/element a.
url<-"https://www.avai.com.br/novo/"
url %>%
read_html() %>%
html_nodes('.gdlr-blog-title') %>% html_nodes('a') %>%
html_attr('href')
[1] "https://www.avai.com.br/novo/se-e-baya-e-bom-atacante-paulo-baya-e-apresentado-no-leao/"
[2] "https://www.avai.com.br/novo/comunicado-arquivada-denuncia-no-stjd/"
[3] "https://www.avai.com.br/novo/sob-chuva-leao-se-reapresenta-nesta-tarde-de-quinta-feira/"
[4] "https://www.avai.com.br/novo/entenda-como-funciona-o-processo-de-apresentacao-de-atletas/"
[5] "https://www.avai.com.br/novo/ingressos-a-venda-para-avai-x-barra-3a-rodada-do-catarinense-fort-2022/"
[6] "https://www.avai.com.br/novo/dona-nesi-furlani-recebe-homenagem-do-avai/"
[7] "https://www.avai.com.br/novo/avai-e-superado-pela-chapecoense-na-ressacada/"
[8] "https://www.avai.com.br/novo/edital-de-convocacao-reuniao-extraordinaria-do-conselho-deliberativo-11/"

Get nodes from a html webpage to crawl URLs using R

https://i.stack.imgur.com/xeczg.png
I am trying to get the URLs under the node '.2lines' from the webpage 'https://www.sgcarmart.com/main/index.php'
library(rvest)
url <- read_html('https://www.sgcarmart.com/main/index.php') %>% html_nodes('.2lines') %>% html_attr()
Which I receive an error for html_nodes function:
Error in parse_simple_selector(stream) :
Expected selector, got <NUMBER '.2' at 1>
How do I get around this error?
You can use an xpath selector to find the nodes you want. The links are actually contained in <a> tags within the <p> tags you are trying to reference by class. You can access them in a single xpath:
library(rvest)
site <- 'https://www.sgcarmart.com'
urls <- site %>%
paste0("/main/index.php") %>%
read_html() %>%
html_nodes(xpath = "//*[#class = '2lines']/a") %>%
html_attr("href") %>%
{paste0(site, .)}
urls
#> [1] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12485"
#> [2] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=11875"
#> [3] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=11531"
#> [4] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=11579"
#> [5] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12635"
#> [6] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12507"
#> [7] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12644"
#> [8] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12622"
#> [9] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12650"
#> [10] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12651"
#> [11] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12589"
#> [12] "https://www.sgcarmart.com/new_cars/newcars_overview.php?CarCode=12649"

How To Extract Name in this HTML Element using rvest

I've searched through many rvest scraping posts but can't find an example like mine. I'm following the R vignette example (https://blog.rstudio.com/2014/11/24/rvest-easy-web-scraping-with-r/) for selectorgadget, but inputting my use case as necessary. None of selector gadget's suggestions get me what I need. I need to extract the name for each review on the page. A sample of what the name looks like under the hood is as follows:
<span itemprop="name" class="sg_selected">This Name</span>
Here's my code to this point. Ideally, this code should get me the individual names on this web page.
library(rvest)
library(dplyr)
dsa_reviews <-
read_html("https://www.directsalesaid.com/companies/traveling-
vineyard#reviews")
review_names <- html_nodes(dsa_reviews,'#reviews span')
df <- bind_rows(lapply(xml_attrs(review_names), function(x)
data.frame(as.list(x), stringsAsFactors=FALSE)))
Apologies if this is a duplicate question or if it's not formatted correctly. Please feel free to request any necessary edits.
Here it is :
library(rvest)
library(dplyr)
dsa_reviews <-
read_html("https://www.directsalesaid.com/companies/traveling-vineyard#reviews")
html_nodes(dsa_reviews,'[itemprop=name]') %>%
html_text()
[1] "Traveling Vineyard" ""
[3] "Kiersten Ray-kuhn" "Miley Sama"
[5] " Nancy Shawtone " "Amanda Moore"
[7] "Matt" "Kathy Barzal"
[9] "Lesa Brinker" "Lori Stryker"
[11] "Jeanette Holtman" "Penny Notarnicola"
[13] "Laura Ann" "Nicole Lafave"
[15] "Gretchen Hess Miller" "Gina Devine"
[17] "Ashley Lawton Converse" "Morgan Williams"
[19] "Angela Baston Mckeone" "Traci Feshler"
[21] "Kisha Marshall Dlugos" "Jody Cole Dvorak"
Colin

in R - crawling with rvest - fail to get the texts in HTML tag using html_text function

url <-"http://news.chosun.com/svc/content_view/content_view.html?contid=1999080570392"
hh = read_html(GET(url),encoding = "EUC-KR")
#guess_encoding(hh)
html_text(html_node(hh, 'div.par'))
#html_text(html_nodes(hh ,xpath='//*[#id="news_body_id"]/div[2]/div[3]'))
I'm trying to crawling the news data(just for practice) using rvest in R.
When I tried it on the homepage above, I failed to fetch the text from the page.
(Xpath doesn't work either.)
I do not think I failed to find the link that contain texts that I want to get on the page. But when I try to extract the text from that link using html_text function, it is extracted as "" or blanks.
I can't find why.. I don't have any experience with HTML and crawling.
What I'm guessing is the HTML tag that contain news body contexts, has "class" and "data-dzo"(I don't know what is it).
So If anyone tell me how to solve it or let me know the search keywords that I can find on google to solve this problem.
It builds quite a bit of the page dynamically. This should help.
The article content is in an XML file. The URL can be constructed from the contid parameter. Either pass in a full article HTML URL (like the one in your example) or just the contid value to this and it'll return an xml2 xml_document with the parsed XML results:
#' Retrieve article XML from chosun.com
#'
#' #param full_url_or_article_id either a full URL like
#' `http://news.chosun.com/svc/content_view/content_view.html?contid=1999080570392`
#' or just the id (e.g. `1999080570392`)
#' #return xml_document
read_chosun_article <- function(full_url_or_article_id) {
require(rvest)
require(httr)
full_url_or_article_id <- full_url_or_article_id[1]
if (grepl("^http", full_url_or_article_id)) {
contid <- httr::parse_url(full_url_or_article_id)
contid <- contid$query$contid
} else {
contid <- full_url_or_article_id
}
# The target article XML URLs are in the following format:
#
# http://news.chosun.com/priv/data/www/news/1999/08/05/1999080570392.xml
#
# so we need to construct it from substrings in the 'contid'
sprintf(
"http://news.chosun.com/priv/data/www/news/%s/%s/%s/%s.xml",
substr(contid, 1, 4), # year
substr(contid, 5, 6), # month
substr(contid, 7, 8), # day
contid
) -> contid_xml_url
res <- httr::GET(contid_xml_url)
httr::content(res)
}
read_chosun_article("http://news.chosun.com/svc/content_view/content_view.html?contid=1999080570392")
## {xml_document}
## <content>
## [1] <id>1999080570392</id>
## [2] <site>\n <id>1</id>\n <name><![CDATA[www]]></name>\n</site>
## [3] <category>\n <id>3N1</id>\n <name><![CDATA[사람들]]></name>\n <path ...
## [4] <type>0</type>
## [5] <template>\n <id>2006120400003</id>\n <fileName>3N.tpl</fileName> ...
## [6] <date>\n <created>19990805192041</created>\n <createdFormated>199 ...
## [7] <editor>\n <id>chosun</id>\n <email><![CDATA[webmaster#chosun.com ...
## [8] <source><![CDATA[0]]></source>
## [9] <title><![CDATA[[동정] 이철승, 순국학생 위령제 지내 등]]></title>
## [10] <subTitle/>
## [11] <indexTitleList/>
## [12] <authorList/>
## [13] <masterId>1999080570392</masterId>
## [14] <keyContentId>1999080570392</keyContentId>
## [15] <imageList count="0"/>
## [16] <mediaList count="0"/>
## [17] <body count="1">\n <page no="0">\n <paragraph no="0">\n <t ...
## [18] <copyright/>
## [19] <status><![CDATA[RL]]></status>
## [20] <commentBbs>N</commentBbs>
## ...
read_chosun_article("1999080570392")
## {xml_document}
## <content>
## [1] <id>1999080570392</id>
## [2] <site>\n <id>1</id>\n <name><![CDATA[www]]></name>\n</site>
## [3] <category>\n <id>3N1</id>\n <name><![CDATA[사람들]]></name>\n <path ...
## [4] <type>0</type>
## [5] <template>\n <id>2006120400003</id>\n <fileName>3N.tpl</fileName> ...
## [6] <date>\n <created>19990805192041</created>\n <createdFormated>199 ...
## [7] <editor>\n <id>chosun</id>\n <email><![CDATA[webmaster#chosun.com ...
## [8] <source><![CDATA[0]]></source>
## [9] <title><![CDATA[[동정] 이철승, 순국학생 위령제 지내 등]]></title>
## [10] <subTitle/>
## [11] <indexTitleList/>
## [12] <authorList/>
## [13] <masterId>1999080570392</masterId>
## [14] <keyContentId>1999080570392</keyContentId>
## [15] <imageList count="0"/>
## [16] <mediaList count="0"/>
## [17] <body count="1">\n <page no="0">\n <paragraph no="0">\n <t ...
## [18] <copyright/>
## [19] <status><![CDATA[RL]]></status>
## [20] <commentBbs>N</commentBbs>
## ...
NOTE: I poked around that site to see this violates their terms of service and it does not seem to but I also relied on google translate and it may have made that harder to find. It's important to ensure you can legally (and, ethically, if you care about ethics) scrape this content for whatever use you intend.