Accessing html Tables with rvest - html

So I am wanting to scrape some NBA data. The following is what I have so far, and it is perfectly functional:
install.packages('rvest')
library(rvest)
url = "https://www.basketball-reference.com/boxscores/201710180BOS.html"
webpage = read_html(url)
table = html_nodes(webpage, 'table')
data = html_table(table)
away = data[[1]]
home = data[[3]]
colnames(away) = away[1,] #set appropriate column names
colnames(home) = home[1,]
away = away[away$MP != "MP",] #remove rows that are just column names
home = home[home$MP != "MP",]
the problem is that these tables don't include the team names, which is important. To get this information, I was thinking I would scrape the four factors table on the webpage, however, rvest doesnt seem to be recognizing this as a table. The div that contains the four factors table is:
<div class="overthrow table_container" id="div_four_factors">
And the table is:
<table class="suppress_all sortable stats_table now_sortable" id="four_factors" data-cols-to-freeze="1"><thead><tr class="over_header thead">
This made me think that I could access the table via something along the lines of
table = html_nodes(webpage,'#div_four_factors')
but this doesnt seem to work as I am getting just an empty list. How can I access the four factors table?

I am by no means an HTML expert but it appears that the table you are interested in is commented out in the source code then the comment is overridden at some point before being rendered.
If we assume that the Home team is always listed second, we can just use positional arguments and scrape another table on the page:
table = html_nodes(webpage,'#bottom_nav_container')
teams <- html_text(table[1]) %>%
stringr::str_split("Schedule\n")
away$team <- trimws(teams[[1]][1])
home$team <- trimws(teams[[1]][2])
Obviously not the cleanest solution but such is life in the world of web scraping

Related

How to scrape with table class name with R?

I am tryng to scrape several web pages, particulaty some tables in the pages.
But the problem is the places of tables change with respect to each page.
Here is my code.
url <- paste0("https://en.wikipedia.org/wiki/2011%E2%80%9312_Welsh_Premier_League")
webpage <- read_html(url)
j<-webpage%>% html_node(xpath='//*[#id="mw-content-text"]/div[1]/table') %>%html_table(fill=T)
This code works fine, but I want to scrape the other seaons, too. The place of table changes in every season.
My question is I found that the table class that I want to scrape is named as "wikitable plainrowheaders", as below. I would like to know how to scrape with table class name.
How to scrape all tables with table class named as "wikitable plainrowheaders" in a wikipedia page?
<table class="wikitable plainrowheaders" style="text-align:center;font-size:100%;">
Since you know the table class name, just change the corresponding xpath.
library(rvest)
url <- paste0("https://en.wikipedia.org/wiki/2011%E2%80%9312_Welsh_Premier_League")
webpage <- read_html(url)
j <- webpage %>%
html_nodes(xpath="//table[#class='wikitable plainrowheaders']") %>%
html_table(fill=T)

Looping: with different row number in R

I wonder if you could give me a hint on how to get over the problem I encountered when trying to extract data from HTML files. I looked through other questions regarding the issue but still cannot figure out what changes exactly should I make. I have five HTML files in a folder. From each of them, I want to extract HTML links which I will later use. First, I extracted this data without any effort reading every HTML separately and creating a separate data frame for each HTML with much-needed links (/item.asp?id=). Then I used a 'rbind' function to merge columns from each data frame. The key here is that the first three HTML pages have 20 rows of the data I need, the fourth HTML has 16 rows, and the fifth and the last has 9 rows.
The looping code works just fine when I loop over the first three pages for which I have 20 rows each, but the code doesn't allow me to do the same for the fourth and fifth HTML pages because there the row number is different. I get the problem:
Error in [[<-.data.frame(*tmp*, i, value = c("/item.asp?id=22529120", : replacement has 16 rows, data has 20
The code is as follows:
#LOOP over others
path = "C:/Users/Dasha/Downloads/R STUDIO/RECTORS/test retrieve"
out.file<-""
file.names <- dir(path, pattern =".html")
for (i in 1:length(file.names))
{
page <- read_html(file.names[i])
links <- page %>% html_nodes("a") %>% html_attr("href")
##get all links into a dataframe
df <-as.data.frame(links)
##get links which contain /item.asp
page_article <- df[grep("/item.asp", df$links), ]
##for each HTML save a separate data frame with links column
java[i] <-as.data.frame(page_article)
##save number of a page where this link is
page_num[i] <- paste(toString(i))
##save id of a person this page belongs to
id[i] <- as.character(file.names[i])
}
Can anyone give a bit of advice on how to solve this issue? If I am successful, I then must be capable to create a single column with links, another column with an id and a number of the HTML page.
Write a function which returns a dataframe after reading from each HTML file.
read_html_files <- function(filename) {
page <- read_html(filename)
links <- page %>% html_nodes("a") %>% html_attr("href")
page_article <- grep("/item.asp", links, value = TRUE)
data.frame(filename, page_article)
}
Use purrr::map_df and pass this function to every file and combine the output in one dataframe (result).
path = "C:/Users/Dasha/Downloads/R STUDIO/RECTORS/test retrieve"
file.names <- list.files(path, pattern ="\\.html$", full.names = TRUE)
result <- purrr::map_df(file.names, read_html_files, .id = 'id')
result

Filter part of the html page when scraping results with Scrapy

I want to scrape the products that are listed in this webpage. So I tried to extract all of the data-tcproduct attributes from the div.product-tile. It contains numerous things including the url of the products I need to visit.
So I did:
def parse_brand(self, response):
for d in set(response.css('div.product-tile::attr(data-tcproduct)').extract()):
d = json.loads(d)
yield scrapy.Request(url=d['product_url_page'].replace("p","P"), callback=self.parse_item)
Yet, I noticed that some attributes from the div.product-tile seems to be hidden in the page and I am not interested by them. Those I want to scrape are rather on product-listing-title.
So how can I filter part of the HTML page when scraping results with Scrapy?
I don't think that you need product-listing-title. You need items from search-result-content div instead:
for d in response.css('div.search-result-content div.product-tile::attr(data-tcproduct)').extract():
d = json.loads(d)
yield scrapy.Request(url=d['product_url_page'].replace("p","P"), callback=self.parse_item)

Table way too wide to fit in Markdown generated PDF

I am trying to display a table from an SQL query to a pdf by using Rmarkdown. However, the table I get is too wide and it does not fit in the document.
I have been recommended to use the Pander package, and so I tried to use the pandoc.table() function which works greatly on the console, but for some reason it stops my document from rendering in Rmarkdown.
The code looks kinda like this :
rz = dbSendQuery(mydb, "select result.id result_id, company.id company_id, (...)")
datz = fetch(rz, n=-1)
It is a very long query but, as I said, it works both on MySQL and R console (working on RStudio).
So, when I do
kable(datz, "latex", col.names = c(colnames(datz)), caption=paste('This is a sample table')) %>% kable_styling(latex_options = "striped") %>% column_spec(1, bold = T, color = "red"))
the results that get printed are too wide to fit in the PDF.
I do not know how can I solve this. I tried with pandoc.tables() from pander package, but the format of the result seems to be very humble compared to the options I have in kable.
You have to use the scale_down option from kableExtra. The scale_down option is going to fit your table on one page when it is too wide. The police font will also be reduce.
Here is an example of the code you could use :
kable(your_dt, "latex", booktabs = T) %>%
kable_styling(latex_options = c("striped", "scale_down"))

child number changes when web scraping for different patents

I am relatively new to web scraping.
I am having problems with child numbers when web scraping for multiple patents. The child number changes accordingly to the the location of the table in the web page. Sometimes the child is "div:nth-child(17)" and other times it is "div:nth-child(18)" when searching for different patents.
My line of code is this one:
IPCs <-sapply("http://www.sumobrain.com/patents/us/Sonic-pulse-echo-method-apparatus/4202215.html", function(url1){
tryCatch(url1 %>%
as.character() %>%
read_html() %>%
html_nodes("#inner_content2 > div:nth-child(17) > div.disp_elm_value3 > table") %>%
html_table(),
error = function(e){NA}
)
})
When I search for another patent (for example: "http://www.sumobrain.com/patents/us/Method-apparatus-quantitative-depth-differential/4982090.html") the child number changes to (18).
I am planning to analyse more than a thousand patents so I would need a code that work for both child numbers. Is there a CSS selector which allows me to select more children? I have tried the "div:nth-child(n)" and "div:nth-child(*)" but they do not work.
I am also open to using a different method. Does anybody have any suggestions?
Try this pseudo classes :
It's a range between 17 and 18.
nth-child(17):nth-child(-n+18)