perform unite function using only dplyr commands? - mysql

Is there any way to concatenate two variables together using only dplyr commands?
For example:
df <- mtcars
df <- select(df, mpg, cyl)
df$mpg <- as.character(df$mpg)
df$cyl <- as.character(df$cyl)
df <- unite(df, "new_var", c(mpg, cyl), sep="", remove = FALSE)
view(df)
I realize that the unite function makes this very simple, but I can't use it as I'm trying to do this operation on an object of type tbl_MariaDBConnection, which, as far as I understand, can only be operated on with dplyr commands. Thanks!

Using mutate+paste0 will do the trick
df <- mtcars
df <- select(df, mpg, cyl)
df$mpg <- as.character(df$mpg)
df$cyl <- as.character(df$cyl)
df <- df %>%
mutate(new_var = paste0(mpg, cyl))
View(df)
For all dplyr : transmute = mutate, but deletes everything else.
df <- mtcars %>%
dplyr::transmute(
mpg = as.character(mpg),
cyl = as.character(cyl),
new_var = paste0(mpg, cyl)
)
df

Related

Scrape page content after option tag is selected

I'd like to scrape the content of a page once the province (and the commune) are selected.
The following code correctly outputs the provinces and their values.
library(rvest)
page <- read_html(x = "https://www.solferinoesanmartino.it/progetto-torelli/progetto-torelli-risultati/")
text <- page %>% html_nodes(xpath='//select[#name="provincia"]/option')%>% html_text()
values <- page %>% html_nodes(xpath='//select[#name="provincia"]/option')%>% html_attr("value")
Res <- data.frame(text = text, values = values, stringsAsFactors = FALSE)
Res
Now, I'd like to access the page for each value, e.g. this might be helpful for getting access to value=19.
text <- page %>% html_nodes(xpath="//*/option[#value = '19']")%>% html_text()
text
The source code is the following
<div class="row results_form_search">
<form role="search" method="POST" class="search-form" action="/progetto-torelli/progetto-torelli-risultati/" id="search_location">
<input type="hidden" name="comune_from" value="" />
<div class="form-row">
<input type="text" name="cognome" placeholder="Cognome" autocomplete="off" value="">
<select name="provincia">
<option value="0" selected>Seleziona Provincia</option>
<option value="74"
>-
</option>
<option value="75"
>AGRIGENTO
</option>
<option value="19"
>ALESSANDRIA
This is where the content that I want to scrape might be.
<div class="row">
<ul class="listing_search">
</ul>
</div>
Thank you so much for your advice!
RSelenium may end up being the way to go. However, if you can insert some judicious waits, or chunk your requests, so server isn't swamped with requests, you can use rvest and make the same requests the page does.
You first need to generate all the combinations of province and comune (filtering out unwanted values); this can be done by making xmlhttp requests, using the value attribute for the options within the select for province, to gather back the comune dropdown options and their associated values.
You then make further requests, for each combination pair, to get the page content, which you would get when making selections from each of those dropdowns manually and pushing CERCA.
Pauses are needed as there are 10,389 valid combinations, by my reckoning, and, if you attempt to make all those requests one after the other, following the initial requests as well, the server will cut-off the connection.
Another option would be to chunk up combined into smaller dataframes and make requests for those at timed intervals and then combine the results.
library(rvest)
library(dplyr)
library(purrr)
get_provincias <- function(link) {
nodes <- read_html(link) %>%
html_nodes('[name="provincia"] > option:not([selected]):not(:contains("-")):not(:contains("\u0085"))')
df <- data.frame(
Provincia = nodes %>% html_text(trim = T),
id0 = nodes %>% html_attr("value")
)
return(df)
}
get_comunes <- function(id) {
link <- sprintf(
"https://www.solferinoesanmartino.it/db-torelli/_get_comuni.php?id0=%s&id1=0&_=%i",
id,
as.numeric(as.POSIXct(Sys.Date(), format = "%Y-%m-%d"))
)
# print(link)
nodes <- read_html(link) %>% html_nodes('option:not([value="0"])')
df <- data.frame(
id0 = id, # id1
Comune = nodes %>% html_text(trim = T),
id3 = nodes %>% html_attr("value")
)
return(df)
}
get_page <- function(prov_id, com_id) {
link <- sprintf(
"https://www.solferinoesanmartino.it/db-torelli/_get_soldati.php?id0=1&id1=&id2=%s&id3=%s&_=%i",
prov_id,
com_id,
as.numeric(as.POSIXct(Sys.Date(), format = "%Y-%m-%d"))
)
page <- read_html(link)
# print(page %>% html_node(".listing_name") %>% html_text(trim = T))
# print(tibble(id3 = com_id, page = page))
return(tibble(id3 = com_id, page = page))
}
provincias <- get_provincias("https://www.solferinoesanmartino.it/progetto-torelli/progetto-torelli-risultati")
comunes <- map_df(provincias$id0, get_comunes) %>% filter(Comune != "-")
combined <- dplyr::right_join(provincias, comunes, by = "id0")
# length(combined$Comune) -> 10389
results <- map2_dfr(combined$id0, combined$id3, .f = get_page)
final <- dplyr::inner_join(combined, results, by = "id3")
Below is a longer version, with the additional info you requested, where I played around with adding pauses. I still found that I could run up to, and including
combined <- dplyr::right_join(provincias, comunes, by = "id0")
in one go. But after that I needed to chunk requests into about 2000 requests batches with 20-30 minutes in between. You can try tweaking the timings below. I ended up using the commented out section to run each batch and then left a pause of 30 mins in between.
Some things to consider:
It seems that you can have comunes values like ... which still return listings. With that in mind you may wish to remove the :not parts of this:
html_nodes('[name="provincia"] > option:not([selected]):not(:contains("-")):not(:contains("\u0085"))')
as I assumed that was filtering out invalid results.
Next, you might consider writing a helper function with httr and retry,
to make the requests with backoff/retry, rather than use pauses.
Such a function might look like this:
httr::RETRY(
"GET",
<request url>,
times = 3,
pause_min = 20*60,
pause_base = 20*60)
Anyway, those are some ideas. Even without the server cutting the connection, via uses of waits, I still found it started to throttle requests, meaning some requests took quite a long time to complete. Optimizing this could potentially take a lot of time and effort. I spent a good few days playing around looking at chunk size and waits.
library(rvest)
library(dplyr)
library(purrr)
get_provincias <- function(link) {
nodes <- read_html(link) %>%
html_nodes('[name="provincia"] > option:not([selected]):not(:contains("-")):not(:contains("\u0085"))')
df <- data.frame(
Provincia = nodes %>% html_text(trim = T),
id0 = nodes %>% html_attr("value")
)
return(df)
}
get_comunes <- function(id) {
link <- sprintf(
"https://www.solferinoesanmartino.it/db-torelli/_get_comuni.php?id0=%s&id1=0&_=%i",
id,
as.numeric(as.POSIXct(Sys.Date(), format = "%Y-%m-%d"))
)
# print(link)
nodes <- read_html(link) %>% html_nodes('option:not([value="0"])')
df <- data.frame(
id0 = id, # id1
Comune = nodes %>% html_text(trim = T),
id3 = nodes %>% html_attr("value")
)
return(df)
}
get_data <- function(prov_id, com_id) {
link <- sprintf(
"https://www.solferinoesanmartino.it/db-torelli/_get_soldati.php?id0=1&id1=&id2=%s&id3=%s&_=%i",
prov_id,
com_id,
as.numeric(as.POSIXct(Sys.Date(), format = "%Y-%m-%d"))
)
# print(link)
page <- read_html(link)
df <- data.frame(
cognome = page %>% html_nodes(".listing_name") %>% html_text(trim = T),
livello = page %>% html_nodes(".listing_level") %>% html_text(trim = T),
id3 = com_id,# for later join back on comune
id0 = prov_id
)
Sys.sleep(.25) # pause for . sec
return(df)
}
get_chunks <- function(df, chunk_size) { # adapted from #BenBolker https://stackoverflow.com/a/7060331
n <- nrow(df)
r <- rep(1:ceiling(n / chunk_size), each = chunk_size)[1:n]
d <- split(df, r)
return(d)
}
write_rows <- function(df, filename) {
flag <- file.exists(filename)
df2 <- purrr::map2_dfr(df$id0, df$id3, .f = get_data)
write.table(df2,
file = filename, sep = ",",
append = flag,
quote = F, col.names = !flag,
row.names = F
)
Sys.sleep(60*10)
}
provincias <- get_provincias("https://www.solferinoesanmartino.it/progetto-torelli/progetto-torelli-risultati")
Sys.sleep(60*5)
comunes <- map_df(provincias$id0, get_comunes) %>% filter(Comune != "-")
Sys.sleep(60*10)
combined <- dplyr::right_join(provincias, comunes, by = "id0")
Sys.sleep(60*10)
chunked <- get_chunks(combined, 2000) # https://stackoverflow.com/questions/7060272/split-up-a-dataframe-by-number-of-rows
filename <- "prov_com_cog_liv.csv"
map(chunked, ~ write_rows(.x, filename))
## #### test case #####################
# df <- chunked[[6]]
#
# flag <- file.exists(filename)
#
# df2 <- map2_dfr(df$id0, df$id3, .f = get_data)
#
# write.table(df2,
# file = filename, sep = ",",
# append = flag,
# quote = F, col.names = !flag,
# row.names = F
# )
####################################
results <- read.csv(filename)
final <- dplyr::right_join(combined, results, by = "id3")

Scraping table with multiple headers in R using any package? (XML, rCurl, rlist htmltab, rvest etc)

I am attempting to scrape this table
http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1
Here are all my attempts. None of them get even close to extracting any information. Am i missing something?
library("rvest")
library("tidyverse")
# METHOD 1
url <- "http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1"
data <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="t1"]/tbody/tr[1]') %>%
html_table()
data <- data[[1]]
# METHOD 2
library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
tables[[which.max(n.rows)]]
# METHOD 3
library(htmltab)
tab <- htmltab("http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1",
which = '//*[#id="t1"]/tbody/tr[4]',
header = '//*[#id="t1"]/tbody/tr[3]',
rm_nodata_cols = TRUE)
# METHOD 4
website <-read_html("http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1")
scraped <- website %>%
html_nodes("table") %>%
.[(2)] %>%
html_table(fill = TRUE) %>%
`[[`(1)
# METHOD 5
getHrefs <- function(node, encoding)
if (!is.null(xmlChildren(node)$a)) {
paste(xpathSApply(node, './a', xmlGetAttr, "href"), collapse = ",")
} else {
return(xmlValue(xmlChildren(node)$text))
}
data <- ( readHTMLTable("http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=1999&m=1", which = 1, elFun = getHrefs) )
The expected results should be the 12 colnames in the table & the data below

R: Vector of JSONs to data.frame

I have a vector of JSONs (of the same structure) and transform it to a data.frame. Following example does exactly what I want.
require(jsonlite) # fromJSON()
require(magrittr) # for the pipeline only
require(data.table) # rbindlist()
jsons <- c('{"num":1,"char":"a","list":{"x":1,"y":2}}',
'{"num":2,"char":"b","list":{"x":1,"y":2}}',
'{"num":3,"char":"c","list":{"x":1,"y":2}}')
df <- jsons %>%
lapply(fromJSON) %>%
lapply(as.data.frame.list, stringsAsFactors = F) %>%
rbindlist(fill = T)
Some elements of the JSON are objects, i.e. if I transform it fromJSON() some elements of the list will be lists as well. I cannot use unlist() to each list because I have different variable types so I am using as.data.frame.list() function. This is however too slow to do for each JSON individually. Is there a way how can I do it more effectively?
json <- '{"$schema":"http://json-schema.org/draft-04/schema#","title":"Product set","type":"array","items":{"title":"Product","type":"object","properties":{"id":{"description":"The unique identifier for a product","type":"number"},"name":{"type":"string"},"price":{"type":"number","minimum":0,"exclusiveMinimum":true},"tags":{"type":"array","items":{"type":"string"},"minItems":1,"uniqueItems":true},"dimensions":{"type":"object","properties":{"length":{"type":"number"},"width":{"type":"number"},"height":{"type":"number"}},"required":["length","width","height"]},"warehouseLocation":{"description":"Coordinates of the warehouse with the product","$ref":"http://json-schema.org/geo"}},"required":["id","name","price"]}}'
system.time(
df <- json %>% rep(1000) %>%
lapply(fromJSON) %>%
lapply(as.data.frame.list, stringsAsFactors = F) %>%
rbindlist(fill = T)
) # 2.72
I know that there are plenty of similar questions but most of the answers I saw was about using as.data.frame() or data.frame(). Nobody mentioned the speed. Maybe there is no better solution to this.
I finally found the answer. It will be on CRAN soon.
devtools::install_github("jeremystan/tidyjson")
tidyjson::spread_all()
This function is about 10-times faster than my example above.
Try to collapse all JSONs in the one string. Let's show example of the solution:
require(jsonlite)
require(data.table)
json <- '{"$schema":"http://json-schema.org/draft-04/schema#","title":"Product set","type":"array","items":{"title":"Product","type":"object","properties":{"id":{"description":"The unique identifier for a product","type":"number"},"name":{"type":"string"},"price":{"type":"number","minimum":0,"exclusiveMinimum":true},"tags":{"type":"array","items":{"type":"string"},"minItems":1,"uniqueItems":true},"dimensions":{"type":"object","properties":{"length":{"type":"number"},"width":{"type":"number"},"height":{"type":"number"}},"required":["length","width","height"]},"warehouseLocation":{"description":"Coordinates of the warehouse with the product","$ref":"http://json-schema.org/geo"}},"required":["id","name","price"]}}'
n <- 1000
ex <- rep(json, 1000)
f1 <- function(x) {
res <- lapply(x, fromJSON)
res <- lapply(res, as.data.frame.list, stringsAsFactors = FALSE)
res <- rbindlist(res, fill = TRUE)
return(res)
}
f2 <- function(x) {
res <- fromJSON(paste0("[", paste(x, collapse = ","), "]"), flatten = TRUE)
lst <- sapply(res, is.list)
res[lst] <- lapply(res[lst], function(x) as.data.table(transpose(x)))
res <- flatten(res)
return(res)
}
bench::mark(
f1(ex), f2(ex), min_iterations = 100, check = FALSE
)
#> # A tibble: 2 x 14
#> expression min mean median max `itr/sec` mem_alloc n_gc n_itr #> total_time result memory time
#> <chr> <bch:t> <bch:t> <bch:t> <bch:tm> <dbl> <bch:byt> <dbl> <int> #> <bch:tm> <list> <list> <lis>
#> 1 f1(ex) 2.27s 2.35s 2.32s 2.49s 0.425 0B 5397 100 #> 3.92m <data… <Rpro… <bch…
#> 2 f2(ex) 48.85ms 63.78ms 57.88ms 116.19ms 15.7 0B 143 100 #> 6.38s <data… <Rpro… <bch…
#> # … with 1 more variable: gc <list>

Replacing NA columns with Median in R

I keep getting errors with the codes, which would be correct?
The first one is correct, but generally you don't want to use for loops in R. Try using apply.
Answer
v <- runif(10000)
v[sample(1000,120)] <- NA
mydata <- as.data.frame(matrix(v,10,100))
for (i in 1:ncol(mydata)){
mydata[is.na(mydata[,i]), i] <- median(mydata[,1], na.rm = TRUE)
}
Apply Solution
v <- runif(10000)
v[sample(1000,120)] <- NA
mydata <- as.data.frame(matrix(v,10,100))
medrep <- function(x){
x[is.na(x)] <- median(x, na.rm=TRUE)
x
}
dat <- data.frame(apply(mydata, 2, medrep))

how to use tidyjson inside dplyr

I have dataframe, called data_df, which has one column which contain json string, column name is json_response.
I want access very specific key-value from it. Example of one of json string as follows. I want to know how many times success is true in string.
x = "[{\"s\":\"D\",\"success\":true,\"start.time\":\"2016-01-27 19:27:27\",\"stop.time\":\"2016-01-27 19:27:30\",\"status_code\":200,\"called\":true,\"milliseconds\":3738.6858,\"_row\":\"DataX\"},{\"s\":\"C\",\"success\":true,\"start.time\":\"2016-01-27 19:27:30\",\"stop.time\":\"2016-01-27 19:27:32\",\"status_code\":200,\"called\":true,\"milliseconds\":1815.1433,\"_row\":\"Clarity\"}]"
If I only want to use tidyjson, I can do it as follows, which works as I want.
library(dplyr)
library(tidyjson)
x %>% gather_array %>%
spread_values(called = jstring("called")) %>%
summarize(x = sum(called == "TRUE"))
Now if I want to do it for whole column, how should I do it? I don't want to use a loop.
Following is my code which I tried to use.
data_df %>%
transmute(
test = json_response %>% gather_array %>%
spread_values(called = jstring("called")) %>%
summarize(x = sum(called=="TRUE"))
)
Following is the error I got when I ran the above code:
Error: not compatible with STRSXP
Instead of using tidyjson you can use rjson combined with dplyr in a way like this:
data_df$test <- data_df %>% rowwise %>%
do(test = .$json_response %>% as.character %>% fromJSON %>% sapply(`[[`, "called") %>% sum) %>%
as.data.frame
You can use tidyjson for this, simply convert data_df into a tbl_json object, and then proceed as before:
data_df %>%
as.tbl_json(json.column = "json_response") %>%
# track each document if you don't already have an ID
mutate(rownum = 1:n()) %>%
gather_array %>%
# use jlogical for correct type
spread_values(success = jlogical("success")) %>%
group_by(rownum) %>%
summarize(num.successes = sum(success))