Simple R question - how to use a loop correctly

Simple R question - how to use a loop correctly - html

I would first like to preface this question by saying I am new to R and would really appreciate any help this community can provide:
Context: I am trying to make this loop go through and read the first data table from a website (https://www.footballdb.com/college-football/teams/fbs/oklahoma/stats/2010.html"), for the years 2010-2019, and then save that data in its own individual table. However, I have been unable to do so, as I don't think I have a firm understanding of how to properly use the loop itself.
My current code:
for(i in 2010:2019)
{
pageurl <-paste("https://www.footballdb.com/college-football/teams/fbs/oklahoma/stats/",i,".html")
content <- read_html(pageurl)
tables[i] <- content %>% html_table(fill = TRUE)
Passing_Table[i] <- tables[[1]]
Passing_Table$year <- i
}*
Any help will be greatly appreciated!

Related

R web scraping difficulty--Why can't I get all of the listing prices from a multi-page website?

I have been trying to scrape data from a real estate website using R's rvest package. The website that I am attempting to scrape listing prices has 15 pages with 631 total listings. However, when I use the script that follows it results with a data frame with only a little over 360 values (it seems to take listing prices from the first 9 pages and then stops). Additionally, when I try using the exact same script right after the first try, it replaces the previous data frame with 0 values. If I wait 30 min and use the same code again, I get the original data frame with ~369 values again. I will include my code below:
library(rvest)
library(purrr)
library(httr)
library(stringr)
url <- "https://www.realtor.com/soldhomeprices/Boulder_CO/type-single-family-home,multi-family-home/pg%d
boulder_sold <- map_df(1:15, function(i){
pg <- read_html(sprintf(url, i))
data.frame(Price = parse_number(html_text(html_nodes(pg, ".data-price"))),
stringsAsFactors = FALSE)
})
I thought that perhaps my problem was that the website was timing out and kicking me off, so I also tried another iteration with a for-loop to try to give breaks in between reading groups of pages. The script for this was:
boulder_sold_break <- map_df(1:15, function(i){
for(j in i){
Sys.sleep(5)
if((i %% 2) == 0){
message("taking a break")
Sys.sleep(2)
}
}
pg <- read_html(sprintf(url, i))
data.frame(Price = parse_number(html_text(html_nodes(pg, ".data-price"))),
stringsAsFactors = FALSE)
})
Therefore, could anyone tell me: 1) Why will my code not give me a data frame with all 631 listing prices? 2) Why does the same script stop giving me any list prices after an initial attempt (and then go back to outputting results after a certain period of time)?

How do I webscrape .dpbox table using selectorgadget with R (rvest)?

I've been trying to webscrape data from a specific website using selectorgadget in R. For example, I successfully webscraped from http://www.dotabuff.com/heroes/abaddon/matchups before. Usually, I just click on the tables I want using the selectorgadget Chrome extension and put the CSS Selection result into the code as follows.
urlx <- "http://www.dotabuff.com/heroes/abaddon/matchups"
rawData <- html_text(html_nodes(read_html(urlx),"td:nth-child(4) , td:nth-child(3), .cell-xlarge"))
In this case, the html_nodes function does return a whole bunch of nodes (340)
{xml_nodeset (340)}
However, when I try to webscrape off http://www.dotapicker.com/heroes/Abaddon using selectorgadget, which turns out to be this code:
urlx <- "http://www.dotapicker.com/heroes/abaddon"
rawData <- html_text(html_nodes(read_html(urlx),".ng-scope:nth-child(1) .ng-scope .ng-binding"))
Unfortunately, no nodes actually show up after the html_nodes function is called, and I get the result
{xml_nodeset (0)}
I feel like this has something to do with the nesting of the table in a drop down box (compared to previously, the table was right on the webpage itself) but I'm not sure how to get around it.
Thank you and I appreciate any help!

It seems like this page load dynamically some data using XHR. In Chrome you can check that by going to inspect and then the network tab. If you do this, you will see that there are a number of json files that are being loaded. You can scrape directly those json files and then parse them to extract the info you need. Here is a quick example:
library(httr)
library(jsonlite)
heroinfo_json <- GET("http://www.dotapicker.com/assets/json/data/heroinfo.json")
heroinfo_flat <- fromJSON(content(heroinfo_json, type = "text"))
#> No encoding supplied: defaulting to UTF-8.
winrates_json <- GET("http://www.dotapicker.com/assets/dynamic/winrates10d.json")
winrates_flat <- fromJSON(content(winrates_json, type = "text"))
#> No encoding supplied: defaulting to UTF-8.

Read all html tables from tennis players activity page

I would like to read all html tables containing Federer's results from this website: http://www.atpworldtour.com/en/players/roger-federer/f324/player-activity
and store the data in one single data frame. One way I figured out was using the rvest package, but as you may notice, my code only works for a specific number of tournaments. Is there any way I can read all relevant tables with one command? Thank you for your help!
Url <- "http://www.atpworldtour.com/en/players/roger-federer/f324/player-activity"
x<- list(length(4))
for (i in 1:4) {
results <- Url %>%
read_html() %>%
html_nodes(xpath=paste0("//table[#class='mega-table'][", i, "]")) %>%
html_table()
results <- results[[1]]
x[[i]] <- resultados
}

Your solution above was close to being the final solution. One downside of your code was having the read_html statement inside the for loop, this would greatly slow down the processing. In the future read the page into a variable and then process the page node by node as necessary.
In this solution, I read the web page into the variable "page" and then extracted the table nodes where class = mega-table. One there, the html_table command returned a list of the tables of interest. The do.call looped a rbind the tables together.
library(rvest)
url <- "http://www.atpworldtour.com/en/players/roger-federer/f324/player-activity"
page<- read_html(url)
tablenodes<-html_nodes(page, "table.mega-table")
tables<-html_table(tablenodes)
#numoftables<-length(tables)
df<-do.call(rbind, tables)

R - Extracting Tables From Websites Using XML Package

I am trying to replicate the method used in a previous answer here Scraping html tables into R data frames using the XML package for my own work but cannot get the data to extract. The website I am using is:
http://www.footballfanalytics.com/articles/football/euro_super_league_table.html
I just wish to extract a table of each team name and their current rating score. My code is as follows:
library(XML)
theurl <- "http://www.footballfanalytics.com/articles/football/euro_super_league_table.html"
tables <- readHTMLTable(theurl)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
tables[[which.max(n.rows)]]
This produces the error message
Error in tables[[which.max(n.rows)]] :
attempt to select less than one element
Could anyone suggest a solution please? Is there something in this particular site causing this not to work? Or is there a better alternative method I can try? Thanks

Seems as if the data is loaded via javascript. Try:
library(XML)
theurl <- "http://www.footballfanalytics.com/xml/esl/esl.xml"
doc <- xmlParse(theurl)
cbind(team = xpathSApply(doc, "/StatsData/Teams/Team/Name", xmlValue),
points = xpathSApply(doc, "/StatsData/Teams/Team/Points", xmlValue))

Extracting population data from website; wiki town webpages

G'day Everyone,
I am looking for a raster layer for human population/habitation in Australia. I have tried finding some free datasets online but couldn't really find anything in a useful formate. I thought it might be interesting to try and scrape population data from wikipedia and make my own raster layer. To this end I have tried getting the info from wiki, but not knowing anything about html has not help me.
The idea is to supply a list of all the towns in Australia that have wiki pages and extract the appropriate data into a data.frame.
I can get the webpage source data into R, but am stuck on how to extract the particular data that I want. The code below shows where I am stuck, any help would be really appreciated or some hints in the right direction.
I thought I might be able to use readHTMLTable() because, in the normal webpage, the info I want is off to the right in a nice table. But when I use this function I get an error (below). Is there any way I can specify this table when I am getting the source info?
Sorry if this question doesn't make much sense, I don't have any idea what I am doing when it comes to searching HTML files.
Thanks for your help, it is greatly appreciated!
Cheers,
Adam
require(RJSONIO)
loc.names <- data.frame(town = c('Sale', 'Bendigo'), state = c('Victoria', 'Victoria'))
u <- paste('http://en.wikipedia.org/wiki/',
sep = '', loc.names[,1], ',_', loc.names[,2])
res <- lapply(u, function(x) htmlParse(x))
Error when I use readHTMLTable:
tabs <- readHTMLTable(res[1])
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"list"’
For instance, some of the data I need looks like this in the html stuff. My question is how do I specify these locations in the HTML stuff I have?
/ <span class="geo">-38.100; 147.067
title="Victoria (Australia)">Victoria</a>. It has a population (2011) of 13,186

res returns a list in this case you need to use res[[1]] rather then res[1] to access its elements.
Using readHTMLTable on these elements will give you all tables. The tables with geo info is contained in a table with class = "infobox vcard" you can just extract these tables seperately then pass them to readHTMLTable
require(XML)
lapply(sapply(res, getNodeSet, path = '//*[#class="infobox vcard"]')
, readHTMLTable)
If you are not familiar with xpaths the selectr package allows you to use css selectors which maybe easier.
require(selectr)
> querySelectorAll(res[[1]], "table span .geo")
[[1]]
<span class="geo">-38.100; 147.067</span>
[[2]]
<span class="geo">-38.100; 147.067</span>

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Simple R question - how to use a loop correctly - html

Related

R web scraping difficulty--Why can't I get all of the listing prices from a multi-page website?

How do I webscrape .dpbox table using selectorgadget with R (rvest)?

Read all html tables from tennis players activity page

R - Extracting Tables From Websites Using XML Package

Extracting population data from website; wiki town webpages

Categories

Resources