Missing tables when using read_html and html_nodes in rvest - html

Apologies for what feels like a dumb question. I'm trying to scrape some of the tables from this recently redesigned page, but only the first table on the page is showing up when I run my code
library(rvest)
url <- 'http://www.basketball-reference.com/players/b/bazemke01.html'
webpage <- read_html(url)
tables <- html_nodes(webpage, 'table')
per_game_table <- html_table(tables)[[1]]
I'm not sure how to find the remaining tables on the page if they aren't showing up in the html_nodes. They look like tables to me. I've tried html_nodes using their xpaths and selectors, but they just don't seem to be there.
Trying to get the next table will give me a subscript out of bounds error.
next_table <- html_table(tables)[[2]]
Error in html_table(tables)[[2]] : subscript out of bounds
Any help would be greatly appreciated!

Related

returning character(0) when scraping with rvest

I'm trying to do some web scraping with rvest. I'm new to R, so I have a bit of a knowledge barrier. I want to scrape the following URL:
https://www.spa.gov.sa/search.php?lang=ar&search=%D8%AD%D9%83%D9%85
That directs to a website in Arabic, but I don't think you need to be able to read Arabic to advise me. Basically, this is the first results page for a specific search term on this website (which is not a search engine). What I want to do is use rvest to scrape this page to return a list of the titles of the hyperlinks returned by the search. Using selectorgadget, I identified that the node containing those titles is called ".h2Newstitle". However, when I try to scrape that node using the code below, all I get in return is "character(0)":
library(tidyverse)
library(rvest)
read_html("https://www.spa.gov.sa/search.php?lang=ar&search=%D8%AD%D9%83%D9%85") %>%
html_nodes(".h2NewsTitle") %>%
html_text()
I don't think the issue here has to do with the Arabic text itself. I'm pretty sure everything is in UTF-8, and I can scrape other nodes on the same page and return Arabic text without issue. For example, the code below returns the Arabic text "بحث أسبوعي", which corresponds to the Arabic text in that node on the page itself:
read_html("https://www.spa.gov.sa/search.php?lang=ar&search=%D8%AD%D9%83%D9%85") %>%
html_nodes("WeeklySearch") %>%
html_text()
So I'm unsure why it is when I try to scrape the ".h2NewsTitle" node, I just get character(0) in return. I wonder if it has to do with some elements being rendered with JavaScript or something. This is a bit outside my expertise, so any advice on how to proceed would be appreciated. I'd like to continue using R, but am open to switching to Python/Beautiful Soup or something if it's better suited for this.

Scraping Github commit author element

Any html whizzes out there able to extract the text for an element on this link: https://github.com/tidyverse/ggplot2
The element text required is
Am currently using rvest in r. Have tried xpath, css etc but just unable to extract the user name. Quite happy to take a link containing the name and cleanse the text using regex if needed.
Any help greatly appreciated.
library(rvest)
read_html("https://github.com/tidyverse/ggplot2") %>%
html_nodes(".user-mention") %>%
html_text()
# [1] "thomasp85"
But if you are trying to grab information from multiple repos, you may want to consider using the official GitHub REST API and/or this lightweight R package client.

{xml_nodeset (0)} problem when attempting to reproduce webscraping example (don't think it's a JS issue)

I'm attempting to learn webscraping using rvest and am trying to reproduce the example given here:
https://www.r-bloggers.com/using-rvest-to-scrape-an-html-table/
Having installed rvest, I simply copy-pasted the code given in the article:
library("rvest")
url <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
population <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="mw-content-text"]/table[1]') %>%
html_table()
population <- population[[1]]
The only difference is that I use read_html() rather than html(), since the latter is deprecated.
Rather than the output reported in the article, this code yields the familiar:
Error in population[[1]] : subscript out of bounds
The origin of which is that running the code without final two lines gives population a value of {xml_nodeset (0)}
All of the previous questions regarding this suggest that this is caused by the table being dynamically formatted in javascript. But this is not the case here (unless Wikipedia has change its formatting since the rbloggers article in 2015).
Any insight would be much appreciated since I'm at a loss!
The html has changed. That xpath is no longer valid. You could do the following:
library("rvest")
url <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
population <- url %>%
read_html() %>%
html_node(xpath='//table') %>%
html_table()
As I have switched to html_node, which returns the first match, I no longer need the index of [1].
The longer xpath now has a div in your original path:
//*[#id="mw-content-text"]/div/table[1]
That is the path you get it you right click copy xpath in the browser on the table.
You want to avoid long xpaths as they are fragile and, as seen, can break easily when the html of the page is changed.
You could also use css and grab by class (for example)
library("rvest")
url <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
population <- url %>%
read_html() %>%
html_node(css='.wikitable') %>%
html_table()

rvest returning empty list

I am trying to import a table from a website by scraping it by copying the xpath of the html code and using the rvest package. I have done this successfully multiple times before, but when I am trying it now I am merely producing an empty list. In an attempt to diagnose my problem, I ran the following code (taken from https://www.r-bloggers.com/using-rvest-to-scrape-an-html-table/). However, this code is also producing an empty list for me.
Thanks in advance for the help!
library(rvest)
url <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
population <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="mw-content-text"]/table[1]') %>%
html_table()
Your xpath query is wrong. The table is not a direct child of the node with an id of mw-content-text. It is a descendant though. Try
html_nodes(xpath='//*[#id="mw-content-text"]//table[1]')
Web scraping is a very fragile endeavor and can easily break when websites change their HTML.

xpathSApply function in R only returning one value when multiple are expected

I'm looking to scrape the article links from this website: http://america.aljazeera.com/topics/topic/categories/us.html
I'm simplifying my task by ignoring the pagination and am only interested in the first 10 articles that are listed and currently have the following syntax:
library(RCurl)
library(XML)
response <- getURL('http://america.aljazeera.com/topics/topic/categories/us.html')
html <- htmlParse(response)
xpath <- "//div[#class='story-holder']//a"
xpathSApply(html, xpath, xmlGetAttr, 'href')
I would have expected to get all of the article links, the links in the images, and the links for the tags on each article (these will be parsed later). However, I'm only getting the first link that is embedded in the thumbnail of the first article. Any idea why it's not returning more results?
Thanks!
That page has invalid HTML markup which is confusing the XML parser. Specifically it has some self-closing div's which seem to be throwing everything off. You can try a more specific xpath expression which avoids the "bad" parts. If you just want the article links, maybe: xpath <- "//div[#class='media-body']//h3/a"