R : Two Different Methods of Webscraping Produce Two Different Results? - html

I am trying to scrape the name, address and longitude/latitude coordinates for each name on a website (e.g. www.mywebsite.com). I used the following the code to get the address and name based on this SO post
library(tidyverse)
library(rvest)
library(httr)
library(XML)
# Define function to scrape 1 page
get_info <- function(page_n) {
cat("Scraping page ", page_n, "\n")
page <- paste0("mywebsite.com",
page_n, "?extension") %>% read_html
tibble(title = page %>%
html_elements(".title a") %>%
html_text2(),
adress = page %>%
html_elements(".marker") %>%
html_text2(),
page = page_n)
}
# Apply function to pages 1:10
df_1 <- map_dfr(1:10, get_info)
# Check dimensions
dim(df_1)
[1] 90
Since I did not know how to modify the above code to extract the coordinates, I wrote a separate script to scrape them:
# Recognize pattern in websites
part1 = "www.mywebsite.com"
part2 = c(0:55)
part3 = "?extension"
temp = data.frame(part1, part2, part3)
# Create list of websites
temp$all_websites = paste0(temp$part1, temp$part2, temp$part3)
# Scrape
df_2 <- list()
for (i in 1:10)
{tryCatch({
url_i <-temp$all_websites[i]
page_i <-read_html(url_i)
b_i = page_i %>% html_nodes("head")
listanswer_i <- b_i %>% html_text() %>% strsplit("\\n")
df_2[[i]] <- listanswer_i
print(listanswer_i)
}, error = function(e){})
}
# Extract long/lat from results
lat_long = grep("LatLng", unlist(df_2[]), value = TRUE)
df_2 = data.frame(str_match(lat_long, "LatLng(\\s*(.*?)\\s*);"))
In the end, scraping the first 10 pages for name/address resulted in 90 entries, but scraping the same 10 pages for the longitude/latitude resulted in 96 entries:
dim(df_1)
[1] 90
dim(df_2)
[1] 96 3
Can someone please help me understand why this is happening and what can I do to fix this?
In the end, I would to make a final table (using df_1 and df_2) that looks something like this:
id name address long lat
1 1 name1 address1 long1 lat1
2 2 name2 address2 long2 lat2
3 3 name3 address3 long3 lat3
Thanks!
Note: I understand that its possible that some names might be missing their latitude/longitudes, and it might not be possible to have the dimensions of "df_1" match the dimensions of "df_2". If this is the case, would it be somehow possible to find out which names are missing their latitude/longitudes (e.g. replace the latitude/longitude entries with NULL for those cases)? For example - suppose the latitude/longitude was not available for "name3":
id name address long lat
1 1 name1 address1 long1 lat1
2 2 name2 address2 long2 lat2
3 3 name3 address3 NA NA

The Problem
The problem is that your second code snippet is not filtering out strings that contain "LatLng" but do not provide coordinates.
After your second code snippet finished scaping the pages, you do the following:
lat_long = grep("LatLng", unlist(df_2[]), value = TRUE)
If you look at the output of this with print(lat_long), you would see a bunch of rows with coordinates. In fact, you'd see exactly 90 such rows because that's how many providers appeared on all those pages. However, you'd also see rows with the string "\t\t\t\tvar bounds = new google.maps.LatLngBounds();". If you go back to the raw HTML you grabbed, you'd see this appears occasionally. Accordingly, you need to remove these rows.
I thought that perhaps you accomplished this with the remaining code, but you never actually remove them. For example, the below code just produces an object filled with NA values. I don't think this does what you want:
as.numeric(gsub("([0-9]+).*$", "\\1", lat_long))
Additionally, the below retains those values as well:
data.frame(str_match(lat_long, "LatLng(\\s*(.*?)\\s*);"))
The Solution
You need to drop elements without coordinates. You'll notice that those elements all contain the substring "LatLngBounds();", so you can just filter them out once they're in a data.frame like below, or using regex.
df_2 %>% filter(X1 != "LatLngBounds();")
Note that this will actually produce 86 rows instead of 90. So, now we're actually short 4 rows. This is because you are not actually collecting all of the GPS coordinates for everyone on the provider page. You can know this because every provide has an address in df_1 and the coordinates are simply passing those addresses to the Maps API.
Why aren't you getting all of the coordinates? My guess is two reasons. First, you are scraping coordinate based on the marker substring. This marker indicates markers/pin on the map. Since the number of pins on the map need not equal the number of providers on the page, you will miss some providers. The less likely issue may have to do with the Google Maps API. If you visit the URLs you create to scrape from (example], you'll see in the bottom left that the Google Maps widget contains the error "This page didn't load Google Maps correctly. See the JavaScript console for technical details". If you look at the JS console, you'll see that an invalid Google Maps API Key was provided. This seems like a likely issue since (a) there is one "LatLngBounds" row per page you are scraping and (b) the row after each of those rows contains coordinates that are not necessarily anywhere near the providers (mine initializes in the U.S. West Coast while the providers are in Canada). I don't know if this has any consequence, but it would explain it if the marker issue isn't the driver.
However, all of this is mostly irrelevant since you don't even need to scrape the coordinates in the first place. You have a list of addresses: you can GeoCode them yourself! There are different ways of doing this, but you can replicate what the site is doing by simply passing them to the Google Maps API! For step-by-step instructions on how to do this, see here.
Identifying the Problem
To provide a better idea of how to approach similar problems in the future, I'll show how I worked through this. One way to approach issue like this is to start by ruling out possible explanations.
Why the problem isn't "missing coordinates"
If the issue was that names are missing coordinates, we would expect nrow(df1) > nrow(df2). However, you reported the opposite: nrow(df2) > nrow(df1).
Why the problem isn't the first code snippet
Since each page contains 9 providers (at least until the last page) and you are scraping 10 pages, we expect to return 9*10 = 90 elements. As you noted, the first code snippet returns an object with 90 rows while the second code snippet returns an object with 96 rows. The second code snippet must be the issue.
Why the problem isn't the pages
Looking at your code, I noticed that you're scraping different pages. Your code to produce df1 iterates over the values of page_n in the interval 1:10. In contrast, your code to produce df2 iterates over the values of page_n in the interval 0:9. This is because the latter code extracts the values of all_websites at indices 1:10, which happen to be the value 0:9 since all_websites is simply the vector 0:55. Since page_n == 0 returns the same page as page_n == 1, your first code is scaping pages 1:10 and your latter code is scraping page c(1,1:9). This means that the values contained in df1 and df2 will differ.
However, this cannot explain the discrepancy in the dimensionality of the two objects since they would still be expected to return 90 rows!

Related

R web scraping difficulty--Why can't I get all of the listing prices from a multi-page website?

I have been trying to scrape data from a real estate website using R's rvest package. The website that I am attempting to scrape listing prices has 15 pages with 631 total listings. However, when I use the script that follows it results with a data frame with only a little over 360 values (it seems to take listing prices from the first 9 pages and then stops). Additionally, when I try using the exact same script right after the first try, it replaces the previous data frame with 0 values. If I wait 30 min and use the same code again, I get the original data frame with ~369 values again. I will include my code below:
library(rvest)
library(purrr)
library(httr)
library(stringr)
url <- "https://www.realtor.com/soldhomeprices/Boulder_CO/type-single-family-home,multi-family-home/pg%d
boulder_sold <- map_df(1:15, function(i){
pg <- read_html(sprintf(url, i))
data.frame(Price = parse_number(html_text(html_nodes(pg, ".data-price"))),
stringsAsFactors = FALSE)
})
I thought that perhaps my problem was that the website was timing out and kicking me off, so I also tried another iteration with a for-loop to try to give breaks in between reading groups of pages. The script for this was:
boulder_sold_break <- map_df(1:15, function(i){
for(j in i){
Sys.sleep(5)
if((i %% 2) == 0){
message("taking a break")
Sys.sleep(2)
}
}
pg <- read_html(sprintf(url, i))
data.frame(Price = parse_number(html_text(html_nodes(pg, ".data-price"))),
stringsAsFactors = FALSE)
})
Therefore, could anyone tell me: 1) Why will my code not give me a data frame with all 631 listing prices? 2) Why does the same script stop giving me any list prices after an initial attempt (and then go back to outputting results after a certain period of time)?

How to deal with missing row when binding column to data frame (a scraping issue!)

I'm attempting to create data frames by attaching URLs to a scraped HTML table, and then writing these to individual csv files. The data are concerned with the passage of Bills through their respective stages in both the House of Commons and Lords. I've written a function (see below) which reads the tables, parses the HTML code, scrapes the URLS required, binds the two together, extracts the rows concerned with the House of Lords, and then writes the csv files. This function is then run across two lists (one of links to the Bill stage page and another of simplified file names).
library(XML)
lords_tables <- function (x, y) {
tables <- as.data.frame(readHTMLTable(x))
sitePage <- htmlParse(x) # This parses web code
hrefs <- xpathSApply(sitePage, "//td/descendant::a[1]",
xmlGetAttr, 'href') ## First href child of the a nodes
table_bind <- cbind(tables, hrefs)
row_no <- grep(".+: House of Lords|Royal Assent",
table_bind$NULL.V2) #Gives row position of Lords|Royal Assent
lords_rows <- table_bind[grep(".+: House of Lords|Royal Assent", table_bind$NULL.V2), ] # Subsets rows containing House of Lords|Royal Assent
write.csv(lords_rows, file = paste0(y, ".csv"))
}
# x = a list of links to the Bill pages/ y = list of simplified names
mapply(lords_tables, x=link_list, y=gsub_URL)
This works perfectly well for the cases where debates occurred for every stage. However, some cases pose a problem, such as:
browseURL("http://services.parliament.uk/bills/2010-12/armedforces/stages.html")
For this example, no debate occurred at the '3rd reading: House of Commons' and again at the 'Royal Assent'. This results in the following error being returned:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 21, 19
In overcoming this error I'd like to have an NA against the missing stage. Has anyone got an idea of how to achieve this? I'm a relative n00b so feel free to suggest a more elegant approach to the whole problem.
Thanks in advance!

How to make a complex list into a dataframe in R?

I have a complex list which is get from a json file.
The json file was get from a map service api in China.
I searched the website to solve the problem but I can't find a proper solution to my question, so I put it in this question and hope it can be solved.
If I missing something that I didn't find in the website, I apologize for that.
The code to get the list are as follows:`
library(rjson)
library(RCurl)
key<-"fd5a14632c36aecd2e759a0cc91a3b4a"
origin<-"大润发东环店"
urlorigin <- paste("http://restapi.amap.com/v3/geocode/geo?key=",key,"&address=",origin,"&city=苏州",sep = "")
dataorigin<-readLines(urlorigin,encoding="UTF-8")
origininfo<-fromJSON(dataorigin)
originpoi<-origininfo$geocodes[[1]]$location
destination<-"苏州大学本部北门"
urldest <- paste("http://restapi.amap.com/v3/geocode/geo?key=",key,"&address=",destination,"&city=苏州",sep = "")
datadest<-readLines(urldest,encoding="UTF-8")
destinfo<-fromJSON(datadest)
destpoi<-destinfo$geocodes[[1]]$location
urlpath <- paste("http://restapi.amap.com/v3/direction/driving?key=",key,"&origin=",originpoi,"&destination=",destpoi, "&originid=&destinationid=&extensions=all&strategy=0&waypoints=&avoidpolygons=&avoidroad=",sep = "")
pathjson<-paste(readLines(urlpath,encoding = "UTF-8"),collapse = "")
pathinfo<-fromJSON(pathjson)
The pathinfo was the list I get at last and I want to convert it into a dataframe that I can work with.
Thank you for your time.
I'm from China and my English is not that good, I apologize for that.
My Chinese is very limited as well. But your code to get the data is working (with some warnings).
pathinfo_df <- as.data.frame(lapply(pathinfo,rbind))
pathinfo_df is now a data_frame.
summary(pathinfo_df)
status info infocode count
1:1 OK:1 10000:1 1:1
route.origin.Length route.origin.Class route.origin.Mode
1 -none- character
route.destination.Length route.destination.Class route.destination.Mode
1 -none- character
route.taxi_cost.Length route.taxi_cost.Class route.taxi_cost.Mode
1 -none- character
route.paths.Length route.paths.Class route.paths.Mode
1 -none- list
So, there's plenty to select and play with. Read up on selecting from lists. see also:
str(pathinfo_df)
Then map it on Google Earth. Looks like the taxi might be costly. Have a good trip!

R: parse JSON/XML exported compound properties from Pubchem

I would like to parse all chemical properties of a given compound as given in Pubchem in R, using the JSON (or XML) export facility.
Example: ALPHA-IONONE, pubchem compound ID 5282108
https://pubchem.ncbi.nlm.nih.gov/compound/5282108
library("rjson")
data <- rjson::fromJSON(file="https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/5282108/JSON/?response_type=display")
or
library("RJSONIO")
data <- RJSONIO::fromJSON("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/5282108/JSON/?response_type=display")
will get me a tree of nested lists, but how do I go from this rather complicated list of nested lists to a nice dataframe or list of dataframes?
In this case, what I am after is everything under
3.1 Computed Descriptors
3.2 Other Identifiers
3.3 Synonyms
4.1 Computed Properties
in a single row of a dataframe and each element in a separate named column with multiple items per element (e.g. multiple synonyms) pasted together with a "|" as a delimiter. E.g. in this case something like
pubchemid IUPAC_Name InChI InChI_Key Canonical SMILES Isomeric SMILES CAS EC Number Wikipedia MeSH Synonyms Depositor-Supplied Synonyms Molecular_Weight Molecular_Formula XLogP3 Hydrogen_Bond_Donor_Count ...
5282108 (E)-4-(2,6,6-trimethylcyclohex-2-en-1-yl)but-3-en-2-one InChI=1S/C13H20O/c1-10-6-5-9-13(3,4)12(10)8-7-11(2)14/h6-8,12H,5,9H2,1-4H3/b8-7+ ....
Fields with multiple items, such as Depositor-Supplied Synonyms could be pasted together with a "|", e.g. value could be ALPHA-IONONE|Iraldeine|...
Second, I would also like to import section
4.2.2 Kovats Retention Index
as a dataframe
pubchemid column_class kovats_ri
5282108 Standard non-polar 1413
5282108 Standard non-polar 1417
...
5282108 Semi-standard non-polar 1427
...
(section 4.3.1 GC-MS would have been nice too, but since it only displays the 3 top peaks this is a little useless right now, so I'll skip that)
Anybody any idea how to achieve this in an elegant way?
PS Note that not all these fields will necessarily exist for any given query.
2D structure and some properties can also be obtained from
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/5282108/record/SDF/?record_type=2d&response_type=display
and 3D structure from
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/5282108/record/SDF/?record_type=3d&response_type=display
Data can also be exported as XML, using
https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/5282108/XML/?response_type=display
if that would be any easier
Note: also tried with R package rpubchem, but that one only seems to import a small amount of the available info:
library("rpubchem")
get.cid(5282108)
CID IUPACName CanonicalSmile MolecularFormula MolecularWeight TotalFormalCharge XLogP HydrogenBondDonorCount HydrogenBondAcceptorCount HeavyAtomCount TPSA
2 5282108 (E)-4-(2,6,6-trimethylcyclohex-2-en-1-yl)but-3-en-2-one C13H20O 192.297300 0 3 0 1 14 17 5282108
My proposal works on XML files, because (thanks to XPath) I find them more convenient to traverse and select nodes.
Please note that this is neither fast (took few seconds while testing) nor optimal (I parse each file twice - once for names and the like and once for Kovats Retention Index). But I guess that you will want to parse some set of files once and go ahead with your real business, and premature optimization is the root of all evil.
I have put main tasks into separate functions. If you want to get data for one specific pubchem record, they are ready to use. But if you want to get data from few pubchem records at once, you can define vector of pointers to data and use examples at the bottom to merge results together. In my case, vector contains paths to files on my local disk. URLs are supported as well, although I would discourage them (remember that each site will be requested twice, and if there is greater number of records, you probably want to handle faulty network somehow).
Compound you have linked to has multiple entries on "EC Number". They do differ by ReferenceNumber, but not by Name. I wasn't sure why it is that way and what should I do with it (your sample output contains only one entry for EC Number), so I left this to R. R added suffixes to duplicated values and created EC.Number.1, EC.Number.2 etc. These suffixes do not match with ReferenceNumber in file and probably the same column in master data frame will refer to different ReferenceNumbers for different compounds.
It seems that pubchem uses following format for tags <type>Value[List]. In few places I have hardcoded StringValue, but maybe some compound has different types in the same fields. I usually haven't considered lists, except where it was requested. So further modifications might be needed as more data is thrown at this code.
If you have any questions, please post them in comments. I am not sure whether I should explain that code or what.
library("xml2")
library("data.table")
compound.attributes <- function(file=NULL) {
compound <- read_xml(file)
ns <- xml_ns(compound)
information <- xml_find_all(compound, paste0(
"//d1:TOCHeading[text()='Computed Descriptors'",
" or text()='Other Identifiers'",
" or text()='Synonyms'",
" or text()='Computed Properties']",
"/following-sibling::d1:Section/d1:Information"
), ns)
properties <- sapply(information, function(x) {
name <- xml_text(xml_find_one(x, "./d1:Name", ns))
value <- ifelse(length(xml_find_all(x, "./d1:StringValueList", ns)) > 0,
paste(sapply(
xml_find_all(x, "./d1:StringValueList", ns),
xml_text, trim=TRUE), sep="", collapse="|"),
xml_text(
xml_find_one(x, "./*[contains(name(),'Value')]", ns),
trim=TRUE)
)
names(value) <- name
return(value)
})
rm(compound, information)
properties <- as.list(properties)
properties$pubchemid <- sub(".*/([0-9]+)/?.*", "\\1", file)
return(data.frame(properties))
}
compound.retention.index <- function(file=NULL) {
pubchemid <- sub(".*/([0-9]+)/?.*", "\\1", file)
compound <- read_xml(file)
ns <- xml_ns(compound)
information <- xml_find_all(compound, paste0(
"//d1:TOCHeading[text()='Kovats Retention Index']",
"/following-sibling::d1:Information"
), ns)
indexes <- lapply(information, function(x) {
name <- xml_text(xml_find_one(x, "./d1:Name", ns))
values <- as.numeric(sapply(
xml_find_all(x, "./*[contains(name(), 'NumValue')]", ns),
xml_text))
data.frame(pubchemid=pubchemid,
column_class=name,
kovats_ri=values)
})
return( do.call("rbind", indexes) )
}
compounds <- c("./5282108.xml", "./5282148.xml", "./91754124.xml")
cd <- rbindlist(
lapply(compounds, compound.attributes),
fill=TRUE
)
rti <- do.call("rbind",
lapply(compounds, compound.retention.index))

Extracting population data from website; wiki town webpages

G'day Everyone,
I am looking for a raster layer for human population/habitation in Australia. I have tried finding some free datasets online but couldn't really find anything in a useful formate. I thought it might be interesting to try and scrape population data from wikipedia and make my own raster layer. To this end I have tried getting the info from wiki, but not knowing anything about html has not help me.
The idea is to supply a list of all the towns in Australia that have wiki pages and extract the appropriate data into a data.frame.
I can get the webpage source data into R, but am stuck on how to extract the particular data that I want. The code below shows where I am stuck, any help would be really appreciated or some hints in the right direction.
I thought I might be able to use readHTMLTable() because, in the normal webpage, the info I want is off to the right in a nice table. But when I use this function I get an error (below). Is there any way I can specify this table when I am getting the source info?
Sorry if this question doesn't make much sense, I don't have any idea what I am doing when it comes to searching HTML files.
Thanks for your help, it is greatly appreciated!
Cheers,
Adam
require(RJSONIO)
loc.names <- data.frame(town = c('Sale', 'Bendigo'), state = c('Victoria', 'Victoria'))
u <- paste('http://en.wikipedia.org/wiki/',
sep = '', loc.names[,1], ',_', loc.names[,2])
res <- lapply(u, function(x) htmlParse(x))
Error when I use readHTMLTable:
tabs <- readHTMLTable(res[1])
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"list"’
For instance, some of the data I need looks like this in the html stuff. My question is how do I specify these locations in the HTML stuff I have?
/ <span class="geo">-38.100; 147.067
title="Victoria (Australia)">Victoria</a>. It has a population (2011) of 13,186
res returns a list in this case you need to use res[[1]] rather then res[1] to access its elements.
Using readHTMLTable on these elements will give you all tables. The tables with geo info is contained in a table with class = "infobox vcard" you can just extract these tables seperately then pass them to readHTMLTable
require(XML)
lapply(sapply(res, getNodeSet, path = '//*[#class="infobox vcard"]')
, readHTMLTable)
If you are not familiar with xpaths the selectr package allows you to use css selectors which maybe easier.
require(selectr)
> querySelectorAll(res[[1]], "table span .geo")
[[1]]
<span class="geo">-38.100; 147.067</span>
[[2]]
<span class="geo">-38.100; 147.067</span>