Loop Returning Indices Instead of Actual Results? - html

I have a list of url's in R. These are stored in a list object called "url".
I also have this loop here that I am using for webscraping. In this loop, if a specific url takes longer than 1 second to finish the webscraping, the loop automatically moves to the next url:
library(R.utils)
res <- vector("list", length = 10L)
for (i in 1:10) {
print(i)
tryCatch({
res[[i]] <- withTimeout({
url_i = url[i]
r_i <- data.frame(fromJSON(url_i))
res[[i]] <- r_i
print(i)
}, timeout = 1)
}, TimeoutException = function(ex) {
message("Timeout. Skipping.")
res[[i]] <- NULL
})
}
This loop runs, but the problem is that it seems like the loop is only returning the "indices" instead of the actual webscraping results. For example:
> res
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
Have I written this loop wrong? And what can I do to fix this?

Related

(R) Webscraping Error : arguments imply differing number of rows: 1, 0

I am working with the R programming language.
In a previous question (R: Webscraping Pizza Shops - "read_html" not working?), I learned how to scrape the names and address of Pizza Stores from YellowPages (e.g. https://www.yellowpages.ca/search/si/2/pizza/Canada). Here is the code for how to scrape a single page:
library(tidyverse)
library(rvest)
scraper <- function(url) {
page <- url %>%
read_html()
tibble(
name = page %>%
html_elements(".jsListingName") %>%
html_text2(),
address = page %>%
html_elements(".listing__address--full") %>%
html_text2()
)
}
I then tried to make a LOOP that will repeat this for all 391 pages:
a = "https://www.yellowpages.ca/search/si/"
b = "/pizza/Canada"
list_results = list()
for (i in 1:391)
{
url_i = paste0(a,i,b)
s_i = data.frame(scraper(url_i))
ss_i = data.frame(i,s_i)
print(ss_i)
list_results[[i]] <- ss_i
}
final = do.call(rbind.data.frame, list_results)
My Problem: I noticed that after the 60th page, I get the following error:
Error in data.frame(i, s_i) :
arguments imply differing number of rows: 1, 0
In addition: Warning message:
In for (i in seq_along(specs)) { :
closing unused connection
To investigate, I went to the 60th page (https://www.yellowpages.ca/search/si/60/pizza/Canada) and noticed that you can not click beyond this page:
My Question: Is there something that I can do differently to try and move past the 60th page, or is there some internal limitation within YellowPages that is preventing from me scraping further?
Thanks!
This is a limit in the Yellow Pages preventing to continue to the next page. A solution is to assign the return value of scraper and check the number of rows. If it is 0, break the for loop.
a = "https://www.yellowpages.ca/search/si/"
b = "/pizza/Canada"
list_results <- list()
for (i in 1:391) {
url_i = paste0(a,i,b)
s <- scraper(url_i, i)
message(paste("page number:", i, "\trows:", nrow(s)))
if(nrow(s) > 0L) {
s_i <- as.data.frame(s)
ss_i <- data.frame(i, s_i)
} else {
message("empty page, bailing out...")
break
}
list_results[[i]] <- ss_i
}
final <- do.call(rbind.data.frame, list_results)
dim(final)
# [1] 2100 3

R - Issue with the DOM of the danish parliament (webscraping)

I've been working on a webscraping project for the political science department at my university.
The Danish parliament is very transparent about their democratic process and they are uploading all the legislative documents on their website. I've been crawling over all pages starting 2008. Right now I'm parsing the information into a dataframe and I'm having an issue that I was not able to resolve so far.
If we look at the DOM we can see that they named most of the objects div.tingdok-normal. The number of objects varies between 16-19. To parse the information correctly for my dataframe I tried to grep out the necessary parts according to patterns. However, the issue is that sometimes my pattern match more than once and I don't know how to tell R that I only want the first match.
for the sake of an example I include some code:
final.url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
to.save <- getURL(final.url)
p <- read_html(to.save)
normal <- p %>% html_nodes("div.tingdok-normal > span") %>% html_text(trim =TRUE)
tomatch <- c("Forkastet regeringsforslag", "Forkastet privat forslag", "Vedtaget regeringsforslag", "Vedtaget privat forslag")
type <- unique (grep(paste(tomatch, collapse="|"), results, value = TRUE))
Maybe you can help me with that
My understanding is that you want to extract the text of the webpage, because the "tingdok-normal" are related to the text. I was able to get the text of the webpage with the following code. Also, the following code identifies the position of the first "regex hit" of the different patterns to match.
library(pagedown)
library(pdftools)
library(stringr)
pagedown::chrome_print("https://www.ft.dk/samling/20161/lovforslag/l154/index.htm",
"C:/.../danish.pdf")
text <- pdftools::pdf_text("C:/.../danish.pdf")
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}
Here is another approach :
library(RDCOMClient)
library(stringr)
library(rvest)
url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
html_Content <- doc$documentElement()$innerText()
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}

rbind fromJSON page: duplicate rowname error

I was trying to rbind some json data scraped from api
library(jsonlite)
pop_dat <- data.frame()
for (i in 1:3) {
# Generate url for each page
url <- paste0('http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=',i)
# Get json data from each page and transform it into dataframe
dat <- as.data.frame(fromJSON(url)[2],flatten = TRUE, row.names = NULL)
pop_dat <- rbind(pop_dat, dat)
}
However, it returns the following error:
Error in row.names<-.data.frame(*tmp*, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘1’, ‘10’, ‘11’, ‘12’, ‘13’, ‘14’, ‘15’, ‘16’, ‘17’, ‘18’, ‘19’, ‘2’, ‘20’, ‘21’, ‘22’, ‘23’, ‘24’, ‘25’, ‘26’, ‘27’, ‘28’, ‘29’, ‘3’, ‘30’, ‘31’, ‘32’, ‘33’, ‘34’, ‘35’, ‘36’, ‘37’, ‘38’, ‘39’, ‘4’, ‘40’, ‘41’, ‘42’, ‘43’, ‘44’, ‘45’, ‘46’, ‘47’, ‘48’, ‘49’, ‘5’, ‘50’, ‘6’, ‘7’, ‘8’, ‘9’
Changing the row.names to null doesn't work. I heard from someone it is due to the fact that some data are stored as lists here, which I don't quite understand.
I understand that there is an alternative package WDI to access this data and it works well, but I want to know how to resolve the duplicates row name problem here in general so that I can deal with similar situation where no alternative package is available.
I heard from someone it is due to the fact that some data are stored as lists...
This is correct. The solution is fairly simple, but I find it really easy to get tripped up by this. Right now you're using:
dat <- as.data.frame(fromJSON(url)[2],flatten = TRUE, row.names = NULL)
The problem comes from fromJSON(url)[2]. This should be fromJSON(url)[[2]] instead. According to the documentation, the key difference between [ and [[ is a single bracket can select multiple elements whereas [[ selects only one.
You can see how this works with some fake data.
foo <- list(
a = rnorm(100),
b = rnorm(100),
c = rnorm(100)
)
With [, you can select multiple values inside this list.
foo[c("a", "b")]
length(foo["a"]) # Result is 1 not 100 like you might expect.
With [[ the results are different.
foo[[c("a", "b")]] # Raises a subscript error.
foo[["a"]] #This works.
length(foo[["a"]]) # Result is 100.
So, your answer will depend on which subset operator you're using. For your problem, you'll want to use [[ to select a single data.frame inside of the list. Then, you should be able to use rbind correctly.
final <- data.frame()
for (i in 1:10) {
url <- paste0(
'http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=',
i
)
res <- jsonlite::fromJSON(url, flatten = TRUE)[[2]]
final <- rbind(final, res)
}
Alternative solution with lapply:
urls <- sprintf(
'http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=%s',
1:10
)
resl <- lapply(urls, jsonlite::fromJSON, flatten = TRUE)
resl <- lapply(resl, "[[", 2) # Use lapply to select the 2 element from each list element.
resl <- do.call(rbind, resl) # This takes all the elements of the list and uses those elements as the arguments for rbind.

For loop - doesn't work properly

I'm trying to use Indeed API to search for specific jobs and I faced a problem when for loop doesn't go through each iterations.
Here is the example of code that I used:
original_url_1 <- "http://api.indeed.com/ads/apisearch?publisher=750330686195873&format=json&q="
original_url_2 <-"&l=Canada&sort=date&radius=10&st=&jt=&start=0&limit=25&fromage=3&filter=&latlong=1&co=ca&chnl=&userip=69.46.99.196&useragent=Mozilla/%2F4.0%28Firefox%29&v=2"
keywords <- c("data+scientist", "data+analyst")
for(i in keywords) {
url <- paste0(original_url_1,i,original_url_2)
x <- as.data.frame(jsonlite::fromJSON(httr::content(httr::GET(url),
as = "text", encoding = "UTF-8")))
data <- rbind(data, x)
}
Url leads to JSON file and adding one of the keyword to the url will change the JSON file. So I'm trying to repeat this for all keywords and store the result in the dataframe. However, when I'm trying to use more keywords I'm getting the result only for a few first keywords.
original_url_1 <- "http://api.indeed.com/ads/apisearch?publisher=750330686195873&format=json&q="
original_url_2 <-"&l=Canada&sort=date&radius=10&st=&jt=&start=0&limit=25&fromage=3&filter=&latlong=1&co=ca&chnl=&userip=69.46.99.196&useragent=Mozilla/%2F4.0%28Firefox%29&v=2"
keywords <- c("data_scientist", "data+analyst")
data<-data.table(NULL)#initialization of object
for(i in keywords) {
url <- paste0(Original_url_1,i,Original_url_2)
x <- as.data.frame(jsonlite::fromJSON(httr::content(httr::GET(url),as = "text", encoding = "UTF-8")))
data <- rbind(data, x)
}
>dim(data)
[1] 39 31
Here is the correct code:
original_url_1 <- "http://api.indeed.com/ads/apisearch?publisher=750330686195873&format=json&q="
original_url_2 <-"&l=Canada&sort=date&radius=10&st=&jt=&start=0&limit=25&fromage=3&filter=&latlong=1&co=ca&chnl=&userip=69.46.99.196&useragent=Mozilla/%2F4.0%28Firefox%29&v=2"
keywords <- c("data+scientist", "data+analyst")
data <- data.frame()
for (i in keywords) {
tryCatch({url <- paste0(original_url_1,i,original_url_2)
x <- as.data.frame(jsonlite::fromJSON(httr::content(httr::GET(url),
as = "text", encoding = "UTF-8")))
data <- rbind(data, x)
}, error = function(t){})
}

Find Jaccard distance of tweets and cluster in Kmeans

This is a follow up question to a problem I've been working on for a while. I have two questions. One regards an algorithm that works on two tweets, that I revised to measure 10 tweets. I'm wondering what my revision is measuring. I get result, but I want it to measure several tweet's jaccard distances, not just return one value. Since it's returning one value, I think it's just adding everything up. The other question is about my attempt to create a For Loop and assign clusters.
I'm trying to find the Jaccard distance between a dataset of tweets, then cluster those tweets with the Kmeans algorithm.
This is where I'm retrieving the data from:
http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Tweets.json
What I have so far is this
install.packages("rjson")
library("rjson")
#download JSON File and put into a dataframe
download.file("http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Tweets.json", tf<-tempfile());library(jsonlite);json_alldata <- fromJSON(sprintf("[%s]", paste(readLines(file(tf)),collapse=",")))
# get rid of geo column
tweet.features = json_alldata
tweet.features$geo <- NULL
# *Works. Compares two tweets and measures Jaccard Distance
tweetText <- list(tweet1 = tweet.features$text[1]:tweet.features$text[2])
jaccard_i <- function(tw1, tw2){
tw1 <- unlist(strsplit(tw1, " |\\."))
tw2 <- unlist(strsplit(tw2, " |\\."))
i <- length(intersect(tw1, tw2))
u <- length(union(tw1, tw2))
list(i=i, u=u, j=i/u)
}
jaccard_i(tweetText[[1]], tweetText[[2]])
All of that measures the jaccard distance of two specified tweets. Which is great.
But now I'm trying to modify to compare the distances between several tweets. This time 10 random tweets which I retrieved from the Sample command in R.
# Generates two sets of 5 random tweets
tweetText <- list(sample(tweet.features$text, replace = FALSE, size = 5), sample(tweet.features$text, replace = FALSE, size = 5))
jaccard_i <- function(tw1, tw2){
tw1 <- unlist(strsplit(tw1, " |\\."))
tw2 <- unlist(strsplit(tw2, " |\\."))
i <- length(intersect(tw1, tw2))
u <- length(union(tw1, tw2))
list(i=i, u=u, j=i/u)
}
jaccard_i(tweetText[[1]], tweetText[[2]])
This gives me results, but it can't be correct.
I'm trying to build an algorithm that can measure all the tweets, compare their jaccard distance, then cluster based on the Jaccard distance with Kmeans.
So for another attempt, I thought to make a For Loop.
I decided to make 10 cluster centers with the 10 random tweets
c <- sample(tweet.features$text, replace = FALSE, size = 10)
Now I did a For Loop hoping to measure the tweets which I figured I could assign to an array and cluster
#Algorithm attempt
for(i in tweet.features$text){
for (j in c){
i <- length(intersect(i, j))
u <- length(union(i, j))
j = i/u
}
#assign(my.array)
}
I don't believe that's doing anything useful, but it's an attempt to create a loop to measure the Jaccard distance.
I'm sorry that this is a loaded question. Any help would be appreciated as I'm a bit lost.
In your first function, you are doing unlist of your lists of words in your tweets, thus in tw1 and tw2 have the global lists of words and you cannot use them for your tweet-by-tweet Jaccard. You could accomplish it with a removing the unlist, then tw1 and tw2 are lists of lists of terms, and you can compare them using mapply. Something as follows.
jaccard_i <- function(tw1, tw2){
tw1 <- strsplit(tw1, " |\\.")
tw2 <- strsplit(tw2, " |\\.")
i <- mapply(function(tw1, tw2) {
length(intersect(tw1, tw2))
}, tw1=tw1, tw2=tw2)
u <- mapply(function(tw1, tw2) {
length(union(tw1, tw2))
}, tw1=tw1, tw2=tw2)
list(i=i, u=u, j=i/u)
}
Silly example:
> tw1 = c("we yes you no", "we are the people")
> tw2= c("we are the people", "we yes you no")
> tweetText = list(tw1, tw2)
> jaccard_i(tweetText[[1]], tweetText[[2]])
$i
[1] 1 1
$u
[1] 7 7
$j
[1] 0.1428571 0.1428571
As for the second part of your question, the double loop, an easy way to start addressing it would be like this,
tw = c("we yes you no", "we are the people")
lapply(tw, function(tweet1) {
lapply(tw, function(tweet2) {
jaccard_i(tweet1, tweet2)
})
})
With results that look like this,
[[1]]
[[1]][[1]]
[[1]][[1]]$i
[1] 4
[[1]][[1]]$u
[1] 4
[[1]][[1]]$j
[1] 1
[[1]][[2]]
[[1]][[2]]$i
[1] 1
[[1]][[2]]$u
[1] 7
[[1]][[2]]$j
[1] 0.1428571
[[2]]
[[2]][[1]]
[[2]][[1]]$i
[1] 1
[[2]][[1]]$u
[1] 7
[[2]][[1]]$j
[1] 0.1428571
[[2]][[2]]
[[2]][[2]]$i
[1] 4
[[2]][[2]]$u
[1] 4
[[2]][[2]]$j
[1] 1
where you should skip the values in the diagonals of course -just a starting point as I said.
Hope it helps.