For loop - doesn't work properly - json

I'm trying to use Indeed API to search for specific jobs and I faced a problem when for loop doesn't go through each iterations.
Here is the example of code that I used:
original_url_1 <- "http://api.indeed.com/ads/apisearch?publisher=750330686195873&format=json&q="
original_url_2 <-"&l=Canada&sort=date&radius=10&st=&jt=&start=0&limit=25&fromage=3&filter=&latlong=1&co=ca&chnl=&userip=69.46.99.196&useragent=Mozilla/%2F4.0%28Firefox%29&v=2"
keywords <- c("data+scientist", "data+analyst")
for(i in keywords) {
url <- paste0(original_url_1,i,original_url_2)
x <- as.data.frame(jsonlite::fromJSON(httr::content(httr::GET(url),
as = "text", encoding = "UTF-8")))
data <- rbind(data, x)
}
Url leads to JSON file and adding one of the keyword to the url will change the JSON file. So I'm trying to repeat this for all keywords and store the result in the dataframe. However, when I'm trying to use more keywords I'm getting the result only for a few first keywords.

original_url_1 <- "http://api.indeed.com/ads/apisearch?publisher=750330686195873&format=json&q="
original_url_2 <-"&l=Canada&sort=date&radius=10&st=&jt=&start=0&limit=25&fromage=3&filter=&latlong=1&co=ca&chnl=&userip=69.46.99.196&useragent=Mozilla/%2F4.0%28Firefox%29&v=2"
keywords <- c("data_scientist", "data+analyst")
data<-data.table(NULL)#initialization of object
for(i in keywords) {
url <- paste0(Original_url_1,i,Original_url_2)
x <- as.data.frame(jsonlite::fromJSON(httr::content(httr::GET(url),as = "text", encoding = "UTF-8")))
data <- rbind(data, x)
}
>dim(data)
[1] 39 31

Here is the correct code:
original_url_1 <- "http://api.indeed.com/ads/apisearch?publisher=750330686195873&format=json&q="
original_url_2 <-"&l=Canada&sort=date&radius=10&st=&jt=&start=0&limit=25&fromage=3&filter=&latlong=1&co=ca&chnl=&userip=69.46.99.196&useragent=Mozilla/%2F4.0%28Firefox%29&v=2"
keywords <- c("data+scientist", "data+analyst")
data <- data.frame()
for (i in keywords) {
tryCatch({url <- paste0(original_url_1,i,original_url_2)
x <- as.data.frame(jsonlite::fromJSON(httr::content(httr::GET(url),
as = "text", encoding = "UTF-8")))
data <- rbind(data, x)
}, error = function(t){})
}

Related

Shiny - assigning argument to function

I am trying to create a shiny app that applies a self-made function to an uploaded dataset, then allows to download the modified results. Here is my code:
library(shiny)
library(tidyverse)
namkurz <- function(data, a_spalte) {
kuerzel <- vector(length = length(data$a_spalte))
for (i in 1:length(data$a_spalte)){
spez = data$Art[i]
s = unlist(strsplit(spez, " ", fixed = TRUE))
s = substr(s, 1, 2)
s = paste(s, collapse = ' ')
kuerzel[[i]] = s
}
data <- data %>%
mutate(kurz = kuerzel)
}
ui <- fluidPage(
fileInput('upload','Deine Kartierungsdaten'),
textInput('art', 'Wie heißt die Spalte mit Artnamen?'),
downloadButton('analyse','Artenkürzel hinzufügen')
)
server <- function(input, output, session) {
data <- reactive({
req(input$upload)
ext <- tools::file_ext(input$upload$name)
switch(ext,
csv = vroom::vroom(input$upload$datapath, delim = ";"),
validate("Invalid file; Please upload a .csv file")
)
})
art <- reactive(input$art)
output$analyse <- downloadHandler(
filename = function() {
paste0('mit_kuerzel', ".csv")
},
content = function(file) {
ergebnis <- reactive(namkurz(data(), art()))
vroom::vroom_write(ergebnis(), file)
}
)
}
shinyApp(ui, server)
When trying to save the output I get a 'Warning: Unknown or uninitialised column:' error. I think my problem is in the assignment of argument 'art' to the 'ergebnis' object, but I can't find the way to fix it.
I recommend a few things:
(Required) In your function, a_spalte is a character vector and not the literal name of column in the frame, so you need to use [[ instead of $, see The difference between bracket [ ] and double bracket [[ ]] for accessing the elements of a list or dataframe and Dynamically select data frame columns using $ and a character value.
Change all references of data$a_spalte to data[[a_spalte]].
namkurz <- function(data, a_spalte) {
kuerzel <- vector(length = length(data[[a_spalte]]))
for (i in 1:length(data[[a_spalte]])){
spez = data$Art[i]
s = unlist(strsplit(spez, " ", fixed = TRUE))
s = substr(s, 1, 2)
s = paste(s, collapse = ' ')
kuerzel[[i]] = s
}
data <- data %>%
mutate(kurz = kuerzel)
}
Your function is a bit inefficient doing things row-wise, we can vectorize that operation.
namkurz <- function(data, a_spalte) {
spez <- strsplit(data$Art, " ", fixed = TRUE)
data$kurz <- sapply(spez, function(z) paste(substr(z, 1, 2), collapse = " "))
data
}
(Optional) The content= portion of downloadHandler is already reactive, you do not need to wrap namkurz in reactive. Because of this, you also don't need to treat ergebnis as reactive.
output$analyse <- downloadHandler(
filename = ...,
content = function(file) {
ergebnis <- namkurz(data(), art())
vroom::vroom_write(ergebnis, file)
}
)
(Optional) Your output filename is fixed, so two things here: if it's always going to be "mit_kuerzel.csv", then there's no need for paste0, just use function() "mit_kuerzel.csv".
However, if you are intending to return a file named something based on the original input filename, one could do something like:
filename = function() {
paste0(tools::file_path_sans_ext(basename(input$upload$name)),
"_mit_kuerzel.",
tools::file_ext(input$upload$name))
},
to add _mit_kuerzel to the base portion of the uploaded filename. Note that the file in the content= section is never this name, the new_mit_kuerzel.csv is the filename offered to the downloading browser as a suggestion, that is all.
(Optional) You are using a .csv file extension in the downloadHandler, but the default for vroom::vroom_write is to use delim = "\t", which is not a CSV. I suggest either adding delim = ";" (or similar), or changing the returned filename extension to .tsv instead.

R - Issue with the DOM of the danish parliament (webscraping)

I've been working on a webscraping project for the political science department at my university.
The Danish parliament is very transparent about their democratic process and they are uploading all the legislative documents on their website. I've been crawling over all pages starting 2008. Right now I'm parsing the information into a dataframe and I'm having an issue that I was not able to resolve so far.
If we look at the DOM we can see that they named most of the objects div.tingdok-normal. The number of objects varies between 16-19. To parse the information correctly for my dataframe I tried to grep out the necessary parts according to patterns. However, the issue is that sometimes my pattern match more than once and I don't know how to tell R that I only want the first match.
for the sake of an example I include some code:
final.url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
to.save <- getURL(final.url)
p <- read_html(to.save)
normal <- p %>% html_nodes("div.tingdok-normal > span") %>% html_text(trim =TRUE)
tomatch <- c("Forkastet regeringsforslag", "Forkastet privat forslag", "Vedtaget regeringsforslag", "Vedtaget privat forslag")
type <- unique (grep(paste(tomatch, collapse="|"), results, value = TRUE))
Maybe you can help me with that
My understanding is that you want to extract the text of the webpage, because the "tingdok-normal" are related to the text. I was able to get the text of the webpage with the following code. Also, the following code identifies the position of the first "regex hit" of the different patterns to match.
library(pagedown)
library(pdftools)
library(stringr)
pagedown::chrome_print("https://www.ft.dk/samling/20161/lovforslag/l154/index.htm",
"C:/.../danish.pdf")
text <- pdftools::pdf_text("C:/.../danish.pdf")
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}
Here is another approach :
library(RDCOMClient)
library(stringr)
library(rvest)
url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
html_Content <- doc$documentElement()$innerText()
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}

IMF data downloading bug

I'm facing bug, which starting to make me really nervous.
So, begin from the start: I writed a code, which downloads data from IMF DOT (countries export ad import data).
Sometime this code works (downloads all data). Sometime. Other time, in middle of downloading i get error:
No encoding supplied: defaulting to UTF-8.
Error: lexical error: invalid char in json text.
<!DOCTYPE HTML PUBLIC "-//W3C//
(right here) ------^
What is funny there - sometime error happens in the beggining of downloading, sometime in the middle. Basically, it comes random. So debugging it is basically fighting with shadow. Maybe anyone has faced with this problem also and can help?
Code:
rm(list = ls())
#Code downloads data from DOT (IMF).
#DOT(date).csv
# Libraries
suppressPackageStartupMessages(library(plyr))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(reshape2))
suppressPackageStartupMessages(library(stringr))
suppressPackageStartupMessages(library(IMFData))
suppressPackageStartupMessages(library(TTR))
suppressPackageStartupMessages(library(readxl))
#Parameters of download -------------------------------------------------------
databaseID <- 'DOT'
startdate <- '1977-01-01'
enddate <- format(Sys.Date(),"%Y-%m-%d")
checkquery = FALSE
# Frequency
download.freq <- c("A")
# Area
available.codes <- DataStructureMethod('DOT')
cn <- available.codes$CL_AREA_DOT
# Download data -----------------------------------------------------------
print("Downloading")
datalist <- list(); queryfilter<- list()
for(i in 1:length(cn[,"CodeValue"])) queryfilter[[i]] <- list(CL_FREA=download.freq, CL_AREA_DOT=cn[,"CodeValue"][i], CL_INDICATOR_DOT = "TXG_FOB_USD"
)
datalist<- plyr::llply(queryfilter, function(x) {
Sys.sleep(runif(1,2,5))
Dot.downloader(databaseID,x, startdate, enddate)
}, .progress = "text")
#WHERE ERRORS HAPPENS
data <- do.call(rbind.data.frame, datalist) #..............................................................................................................................................................................................................................................
Oh and Dot.downloader function looks like this (it is from IMFdata library, just a bit adapted to the situation):
Dot.downloader <- function(databaseID, queryfilter=NULL,
startdate='1977-01-01', enddate='2016-12-31'){
queryfilterstr <- ''
if (length(queryfilter) > 0){
queryfilterstr <- paste0(
unlist(plyr::llply(queryfilter,
function(x)(paste0(x, collapse="+")))), collapse=".")
}
APIstr <- paste0('http://dataservices.imf.org/REST/SDMX_JSON.svc/CompactData/',
databaseID,'/',queryfilterstr,
'?startPeriod=',startdate,'&endPeriod=',enddate)
r <- httr::GET(APIstr)
if(httr::http_status(r)$reason != "OK"){
stop(paste(unlist(httr::http_status(r))))
return(list())
}
r.parsed <- jsonlite::fromJSON(httr::content(r, "text"))
if(is.null(r.parsed$CompactData$DataSet$Series)){
warning("No data available")
return(NULL)
}
if(class(r.parsed$CompactData$DataSet$Series) == "data.frame"){
r.parsed$CompactData$DataSet$Series <- r.parsed$CompactData$DataSet$Series[!plyr::laply(r.parsed$CompactData$DataSet$Series$Obs, is.null),]
if(nrow(r.parsed$CompactData$DataSet$Series) ==0){
warning("No data available")
return(NULL)
}
}
if(class(r.parsed$CompactData$DataSet$Series) == "list"){
if(is.null(r.parsed$CompactData$DataSet$Series$Obs)){
warning("No data available")
return(NULL)
}
ret.df <- as.data.frame(r.parsed$CompactData$DataSet$Series[1:(length(r.parsed$CompactData$DataSet$Series)-1)])
ret.df$Obs <- list(r.parsed$CompactData$DataSet$Series$Obs)
names(ret.df) <- names(r.parsed$CompactData$DataSet$Series)
r.parsed$CompactData$DataSet$Series <- ret.df
}
return(r.parsed$CompactData$DataSet$Series)
}

Loading json encoded log files into R for analysis

I have a log file with each line a json-encoded entry:
{"requestId":"5550d","partnerId":false,"ip":"170.158.3.1", ... }
I tried reading each line, json_decode and then append to the dataframe:
loadLogs <- function(fileName) {
conn <- file(fileName, "r", blocking = FALSE)
linn <- readLines(conn)
long <- length(linn)
df = data.frame(requestId = character(0),
partnerId = character(0),
ip = character(0)
)
for (i in 1:long) {
jsonRow <- fromJSON(linn[i])
df <- rbind(df, data.frame(requestId = jsonRow$requestId,
partnerId = as.character(jsonRow$partnerId),
ip = jsonRow$ip
}
close(conn)
return(df)
}
The above code is extremely slow for large files though. Is there any way to speed this up? A few options I can think of at the moment:
pre-allocate the dataframe as its copying the entire data for a new append
using apply function on the json_decode
???
How would I do (1) and (2) in R? I'm new.
Thanks for looking at my question.

Trying to parse IMDb but the links are different each time I open site

I try to get links to all sites with popular feature films in IMDb. There is no problem with first 2000 sites since they have exactly the same "body", for example:
http://www.imdb.com/search/title?at=0&sort=moviemeter,asc&start=1&title_type=feature
http://www.imdb.com/search/title?at=0&sort=moviemeter,asc&start=99951&title_type=feature
Each site consists with 50 links to movies, so in links the "parameter" start says that on this site there are links to movies from start to start + 50.
Problem is with the pages followed by one with parameter 99951. At the end of each one there is extra part of url like &tok=0f97 for example
http://www.imdb.com/search/title?at=0&sort=moviemeter,asc&start=100051&title_type=feature&tok=13c9
So when I try to parse this site to get links for all 50 movies (I use R for this) I get nothing.
The code I use to parse sites and it works on first 2000 links:
makeListOfUrls <- function() {
howManyPages <- round(318485/50)
urlStart <- "http://www.imdb.com/search/title?at=0&sort=moviemeter,asc&start=1&title_type=feature"
linksList <- list()
for (i in 1:howManyPages){
j <- 50 * (i - 1) + 1
print(j)
startNew <- paste("start=", j, sep="")
urlNew <- stri_replace_all_regex(urlStart, "start=1", startNew)
titleLinks <- getLinks(urlNew)
## I get empty character for sites 2001 and next !!!
linksList[[i]] <- makeLongPath(titleLinks)
}
vector <- combineList(linksList)
return(vector)
}
getLinks <- function(url) {
allLinks <- getHTMLLinks(url, xpQuery = "//#href")
titleLinks <- allLinks[stri_detect_regex(allLinks, "^/title/tt[0-9]+/$")]
#there are no links for movies for the pages after 2000 (titleLinks is empty)
titleLinks <- titleLinks[!duplicated(titleLinks)]
return(titleLinks)
}
makeLongPath <- function(links) {
longPaths <- paste("http://www.imdb.com", links, sep="")
return(longPaths)
}
combineList <- function(UrlList){
n <- length(UrlList)
if (n==1){
return(UrlList)
} else {
tmpV <- UrlList[[1]]
for (i in 2:n){
cV <- c(tmpV, UrlList[[i]])
tmpV <- cV
}
return(tmpV)
}
}
So, is there any way to access these sites?