Is there a robust scaler method in the recipes package in the R programming language? As a result of my research, I could not find this method.
I'm assuming you are referring to the RobustScaler from scikit-learn. You are correct that there isn't a similar step in the recipes package.
It is implemented in the extrasteps package which you can install with
# install.packages("devtools")
devtools::install_github("EmilHvitfeldt/extrasteps")
Then you can use the step_robust() which will do what you are expecting.
library(recipes)
library(extrasteps)
rec <- recipe(~., data = mtcars) %>%
step_robust(all_predictors()) %>%
prep()
rec %>%
bake(new_data = NULL)
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.244 0 -0.177 -0.156 0.244 -0.685 -0.623 0 1 0 1
#> 2 0.244 0 -0.177 -0.156 0.244 -0.437 -0.344 0 1 0 1
#> 3 0.488 -0.5 -0.430 -0.359 0.185 -0.977 0.448 1 1 0 -0.5
#> 4 0.298 0 0.301 -0.156 -0.732 -0.107 0.862 1 0 -1 -0.5
#> 5 -0.0678 0.5 0.798 0.623 -0.649 0.112 -0.344 0 0 -1 0
#> 6 -0.149 0 0.140 -0.216 -1.11 0.131 1.25 1 0 -1 -0.5
#> 7 -0.664 0.5 0.798 1.46 -0.577 0.238 -0.932 0 0 -1 1
#> 8 0.705 -0.5 -0.242 -0.731 -0.00595 -0.131 1.14 1 0 0 0
#> 9 0.488 -0.5 -0.271 -0.335 0.268 -0.170 2.59 1 0 0 0
#> 10 0 0 -0.140 0 0.268 0.112 0.294 1 0 0 1
#> # … with 22 more rows
tidy(rec, 1)
#> # A tibble: 33 × 4
#> terms statistic value id
#> <chr> <chr> <dbl> <chr>
#> 1 mpg lower 15.4 robust_hS9q6
#> 2 mpg median 19.2 robust_hS9q6
#> 3 mpg higher 22.8 robust_hS9q6
#> 4 cyl lower 4 robust_hS9q6
#> 5 cyl median 6 robust_hS9q6
#> 6 cyl higher 8 robust_hS9q6
#> 7 disp lower 121. robust_hS9q6
#> 8 disp median 196. robust_hS9q6
#> 9 disp higher 326 robust_hS9q6
#> 10 hp lower 96.5 robust_hS9q6
#> # … with 23 more rows
rec <- recipe(~., data = mtcars) %>%
step_robust(all_predictors(), range = c(0.1, 0.9)) %>%
prep()
rec %>%
bake(new_data = NULL)
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.114 0 -0.115 -0.0732 0.171 -0.337 -0.281 0 1 0
#> 2 0.114 0 -0.115 -0.0732 0.171 -0.215 -0.155 0 1 0
#> 3 0.229 -0.5 -0.280 -0.169 0.129 -0.480 0.202 1 1 0
#> 4 0.140 0 0.196 -0.0732 -0.512 -0.0526 0.388 1 0 -0.5
#> 5 -0.0317 0.5 0.519 0.293 -0.453 0.0550 -0.155 0 0 -0.5
#> 6 -0.0698 0 0.0910 -0.101 -0.778 0.0645 0.563 1 0 -0.5
#> 7 -0.311 0.5 0.519 0.687 -0.403 0.117 -0.420 0 0 -0.5
#> 8 0.330 -0.5 -0.157 -0.344 -0.00416 -0.0645 0.514 1 0 0
#> 9 0.229 -0.5 -0.176 -0.158 0.187 -0.0837 1.16 1 0 0
#> 10 0 0 -0.0910 0 0.187 0.0550 0.132 1 0 0
#> # … with 22 more rows, and 1 more variable: carb <dbl>
tidy(rec, 1)
#> # A tibble: 33 × 4
#> terms statistic value id
#> <chr> <chr> <dbl> <chr>
#> 1 mpg lower 14.3 robust_MygTA
#> 2 mpg median 19.2 robust_MygTA
#> 3 mpg higher 30.1 robust_MygTA
#> 4 cyl lower 4 robust_MygTA
#> 5 cyl median 6 robust_MygTA
#> 6 cyl higher 8 robust_MygTA
#> 7 disp lower 80.6 robust_MygTA
#> 8 disp median 196. robust_MygTA
#> 9 disp higher 396 robust_MygTA
#> 10 hp lower 66 robust_MygTA
#> # … with 23 more rows
I'm using an AWS mariaDB to store some data. My idea was to do the full management with the DBI package. However, I have found that DBI only imports the first row of the data when I try to write a table in the db. I have to use DBI::dbCreateTable and dbx::dbxInsert. I can't figure out why DBI is not importing the full data frame.
I have gone through this post but the conclusion is not quite clear. This is the code/output:
con <- DBI::dbConnect(odbc::odbc(), "my_odbc", timeout = 10)
## Example 1 - doesn't work
DBI::dbWriteTable(con, "test1", mtcars)
DBI::dbReadTable(con, "test1")
row_names mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
# Example 2 - doesn't work
DBI::dbCreateTable(con, "test2", mtcars)
DBI::dbAppendTable(con, "test2", mtcars)
[1] 1
DBI::dbReadTable(con, "test2")
mpg cyl disp hp drat wt qsec vs am gear carb
1 21 6 160 110 3.9 2.62 16.46 0 1 4 4
# Example 3 - does work.
DBI::dbCreateTable(con, "test3", mtcars)
dbx::dbxInsert(con, "test3", mtcars)
DBI::dbReadTable(con, "test3")
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
I had a similar issue and if you aren't careful with how you define and use your primary keys you get this issue. The first row is allowed as its the first with that primary key and then the rows after are blocked and hence dont get inserted.
library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("http://legacy.baseballprospectus.com/sortable/index.php?cid=2022181",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
I'm trying to scrape the 2016 table data from the above webpage. If I change the Year to 2010, the url changes to http://legacy.baseballprospectus.com/sortable/index.php?cid=1966487.
I want to automate my algorithm so that it can obtain the table across different Year, but I'm not sure how I can obtain the unique identifiers (e.g. 1966487) for each page automatically. Is there a way to find the list of these?
I've tried looking at the html source code, but no luck.
With rvest, you can set the value in the form and submit it. Wrapped in purrr::map_dfr to iterate and row-bind the results in to a data frame,
library(rvest)
sess <- html_session("http://legacy.baseballprospectus.com/sortable/index.php?cid=2022181")
baseball <- purrr::map_dfr(
2017:2015,
function(y){
Sys.sleep(10 + runif(1)) # be polite
form <- sess %>%
html_node(xpath = '//form[#action="index.php"]') %>%
html_form() %>%
set_values(year = y)
sess <- submit_form(sess, form)
sess %>%
read_html() %>%
html_node('#TTdata') %>%
html_table(header = TRUE)
}
)
tibble::as_data_frame(baseball) # for printing
#> # A tibble: 4,036 x 38
#> `#` NAME TEAM LG YEAR AGE G PA AB R
#> <dbl> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int>
#> 1 1 Giancarlo Stanton MIA NL 2017 27 159 692 597 123
#> 2 2 Joey Votto CIN NL 2017 33 162 707 559 106
#> 3 3 Charlie Blackmon COL NL 2017 30 159 725 644 137
#> 4 4 Aaron Judge NYA AL 2017 25 155 678 542 128
#> 5 5 Nolan Arenado COL NL 2017 26 159 680 606 100
#> 6 6 Kris Bryant CHN NL 2017 25 151 665 549 111
#> 7 7 Mike Trout ANA AL 2017 25 114 507 402 92
#> 8 8 Jose Altuve HOU AL 2017 27 153 662 590 112
#> 9 9 Paul Goldschmidt ARI NL 2017 29 155 665 558 117
#> 10 10 Jose Ramirez CLE AL 2017 24 152 645 585 107
#> # ... with 4,026 more rows, and 28 more variables: H <int>, `1B` <int>,
#> # `2B` <int>, `3B` <int>, HR <int>, TB <int>, BB <int>, IBB <int>,
#> # SO <int>, HBP <int>, SF <int>, SH <int>, RBI <int>, DP <int>,
#> # NETDP <dbl>, SB <int>, CS <int>, AVG <dbl>, OBP <dbl>, SLG <dbl>,
#> # OPS <dbl>, ISO <dbl>, BPF <int>, oppOPS <dbl>, TAv <dbl>, VORP <dbl>,
#> # FRAA <dbl>, BWARP <dbl>
I'm trying to use the information contained in keyed JSON names to add context to the data contained in their nested matrices. The matrices have different numbers of rows, and some of the matrices are missing (list element NULL). I am able to extract the relevant data and retain information as list names from the hierarchy using map and at_depth from the purrr package, but I cannot find a clean way to get this into a single data.frame.
I have attempted to use purrr:::transpose as exemplified here, and I've tried using tidyr:::unnest as shown here, but I think their desired results and inputs differ enough from mine that they are not applicable. There seems to be too many problems with the differing row names and/or the missing matrices. I am also new to the purrr package, so there could be something simple that I'm missing here.
Here is my own attempt which produces nearly the desired result, and I think I could modify it a bit more to remove the for loop and have another layer of some 'apply' functions, but I have the suspicion that there are better ways to go about this.
Minimal reproducible Example
#Download data
json <- getURL("http://maps2.dnr.state.mn.us/cgi-bin/lakefinder/detail.cgi?type=lake_survey&id=69070100")
#Surveys are the relevant data
data.listed <- fromJSON(json, simplifyDataFrame=F)
surveys <- data.listed$result$surveys
#Get list of lists of matrices - fish size count data
fcounts <- map(surveys, "lengths") %>%
at_depth(2, "fishCount") %>%
at_depth(2, data.frame) # side note: is this a good way to the inner matrices to data.frames?
#top-level - list - surveys
#2nd-level - list - species in each survey
#3rd-level - data.frame - X1: measured_size, X2: counts
#use survey IDs as names for top level list
#just as species are used as names for 2nd level lists
names(fcounts) <- sapply(surveys, function(s) {return(s$surveyID)})
#This produces nearly the correct result
for (i in 1:length(fcounts)){
surv.id <- names(fcounts)[[i]]
if (length(fcounts[[i]]) > 0) {
listed.withSpecies <- lapply(names(fcounts[[i]]), function(species) cbind(fcounts[[i]][[species]], species))
surv.fishCounts <- do.call(rbind, listed.withSpecies)
colnames(surv.fishCounts) <- c("size", "count", "species")
surv.fishCounts$survey.ID <- surv.id
print(surv.fishCounts)
}
}
This is one way to get nested data frames of the lengths counts into a big data frame:
library(httr)
library(tidyverse)
res <- GET("http://maps2.dnr.state.mn.us/cgi-bin/lakefinder/detail.cgi",
query = list(type="lake_survey", id="69070100"))
content(res, as="text") %>%
jsonlite::fromJSON(simplifyDataFrame = FALSE, flatten=FALSE) -> x
x$result$surveys %>%
map_df(~{
tmp_df <- flatten_df(.x[c("surveyDate", "surveyID", "surveyType", "surveySubType")])
lens <- .x$lengths
if (length(lens) > 0) {
fish <- names(lens)
data_frame(fish,
max_length = map_dbl(lens, "maximum_length"),
min_length = map_dbl(lens, "minimum_length"),
lens = map(lens, "fishCount") %>%
map(~set_names(as_data_frame(.), c("catch_len", "ct")))) %>%
mutate(surveyDate = tmp_df$surveyDate,
surveyType = tmp_df$surveyType,
surveySubType = tmp_df$surveySubType,
surveyID = tmp_df$surveyID) -> tmp_df
}
tmp_df
}) -> lengths_df
glimpse(lengths_df)
## Observations: 21
## Variables: 8
## $ surveyDate <chr> "1988-07-19", "1995-07-17", "1995-07-17", "1995-07-17", "1995-07-17", "1995-07-17", "1995-07-...
## $ surveyID <chr> "107278", "107539", "107539", "107539", "107539", "107539", "107539", "107539", "107539", "10...
## $ surveyType <chr> "Standard Survey", "Standard Survey", "Standard Survey", "Standard Survey", "Standard Survey"...
## $ surveySubType <chr> "Population Assessment", "Re-Survey", "Re-Survey", "Re-Survey", "Re-Survey", "Re-Survey", "Re...
## $ fish <chr> NA, "PMK", "BLB", "LMB", "YEP", "BLG", "WTS", "WAE", "NOP", "GSF", "BLC", NA, "HSF", "PMK", "...
## $ max_length <dbl> NA, 6, 12, 16, 6, 7, 18, 18, 36, 4, 10, NA, 8, 7, 12, 12, 6, 8, 23, 38, 12
## $ min_length <dbl> NA, 3, 10, 1, 3, 3, 16, 16, 6, 4, 4, NA, 7, 4, 10, 12, 5, 3, 12, 9, 7
## $ lens <list> [NULL, <c("3", "6"), c("1", "3")>, <c("10", "11", "12"), c("1", "1", "4")>, <c("1", "16", "2...
print(lengths_df, n=nrow(lengths_df))
## # A tibble: 21 × 8
## surveyDate surveyID surveyType surveySubType fish max_length min_length lens
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <list>
## 1 1988-07-19 107278 Standard Survey Population Assessment <NA> NA NA <NULL>
## 2 1995-07-17 107539 Standard Survey Re-Survey PMK 6 3 <tibble [2 × 2]>
## 3 1995-07-17 107539 Standard Survey Re-Survey BLB 12 10 <tibble [3 × 2]>
## 4 1995-07-17 107539 Standard Survey Re-Survey LMB 16 1 <tibble [6 × 2]>
## 5 1995-07-17 107539 Standard Survey Re-Survey YEP 6 3 <tibble [3 × 2]>
## 6 1995-07-17 107539 Standard Survey Re-Survey BLG 7 3 <tibble [5 × 2]>
## 7 1995-07-17 107539 Standard Survey Re-Survey WTS 18 16 <tibble [3 × 2]>
## 8 1995-07-17 107539 Standard Survey Re-Survey WAE 18 16 <tibble [2 × 2]>
## 9 1995-07-17 107539 Standard Survey Re-Survey NOP 36 6 <tibble [17 × 2]>
## 10 1995-07-17 107539 Standard Survey Re-Survey GSF 4 4 <tibble [1 × 2]>
## 11 1995-07-17 107539 Standard Survey Re-Survey BLC 10 4 <tibble [6 × 2]>
## 12 1992-07-24 107587 Standard Survey Re-Survey <NA> NA NA <NULL>
## 13 2005-07-11 107906 Standard Survey Population Assessment HSF 8 7 <tibble [2 × 2]>
## 14 2005-07-11 107906 Standard Survey Population Assessment PMK 7 4 <tibble [4 × 2]>
## 15 2005-07-11 107906 Standard Survey Population Assessment BLB 12 10 <tibble [3 × 2]>
## 16 2005-07-11 107906 Standard Survey Population Assessment LMB 12 12 <tibble [1 × 2]>
## 17 2005-07-11 107906 Standard Survey Population Assessment YEP 6 5 <tibble [2 × 2]>
## 18 2005-07-11 107906 Standard Survey Population Assessment BLG 8 3 <tibble [6 × 2]>
## 19 2005-07-11 107906 Standard Survey Population Assessment WAE 23 12 <tibble [8 × 2]>
## 20 2005-07-11 107906 Standard Survey Population Assessment NOP 38 9 <tibble [20 × 2]>
## 21 2005-07-11 107906 Standard Survey Population Assessment BLC 12 7 <tibble [4 × 2]>
You can expand the nested catch observations this way:
filter(lengths_df, !map_lgl(lens, is.null)) %>%
unnest(lens)
## # A tibble: 98 × 9
## surveyDate surveyID surveyType surveySubType fish max_length min_length catch_len ct
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <int> <int>
## 1 1995-07-17 107539 Standard Survey Re-Survey PMK 6 3 3 1
## 2 1995-07-17 107539 Standard Survey Re-Survey PMK 6 3 6 3
## 3 1995-07-17 107539 Standard Survey Re-Survey BLB 12 10 10 1
## 4 1995-07-17 107539 Standard Survey Re-Survey BLB 12 10 11 1
## 5 1995-07-17 107539 Standard Survey Re-Survey BLB 12 10 12 4
## 6 1995-07-17 107539 Standard Survey Re-Survey LMB 16 1 1 1
## 7 1995-07-17 107539 Standard Survey Re-Survey LMB 16 1 16 1
## 8 1995-07-17 107539 Standard Survey Re-Survey LMB 16 1 2 6
## 9 1995-07-17 107539 Standard Survey Re-Survey LMB 16 1 4 4
## 10 1995-07-17 107539 Standard Survey Re-Survey LMB 16 1 5 2
## # ... with 88 more rows
I'm new to R, and have been trying a bunch of examples but I couldn't get anything to change all of my empty cells into NA.
library(XML)
theurl <- "http://www.pro-football-reference.com/teams/sfo/1989.htm"
table <- readHTMLTable(theurl)
table
Thank you.
The result you get from readHTMLTable is giving you a list of two tables, so you need to work on each list element, which can be done using lapply
table <- lapply(table, function(x){
x[x == ""] <- NA
return(x)
})
table$team_stats
Player PF Yds Ply Y/P TO FL 1stD Cmp Att Yds TD Int NY/A 1stD Att Yds TD Y/A 1stD Pen Yds 1stPy
1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1 209 493 1966 14 4.0 124 109 922 17
2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3 178 372 1383 9 3.7 76 75 581 29
3 Lg Rank Offense 1 1 <NA> <NA> 2 10 1 <NA> 20 2 1 1 1 <NA> 13 10 12 13 <NA> <NA> <NA> <NA>
4 Lg Rank Defense 3 4 <NA> <NA> 11 9 9 <NA> 25 11 3 9 5 <NA> 1 3 3 8 <NA> <NA> <NA> <NA>
You have a list of data.frames of factors, though the actual data is mostly numeric. Converting to the appropriate type with type.convert will automatically insert the appropriate NAs for you:
df_list <- lapply(table, function(x){
x[] <- lapply(x, function(y){type.convert(as.character(y), as.is = TRUE)});
x
})
df_list[[1]][, 1:18]
## Player PF Yds Ply Y/P TO FL 1stD Cmp Att Yds.1 TD Int NY/A 1stD.1 Att.1 Yds.2 TD.1
## 1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1 209 493 1966 14
## 2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3 178 372 1383 9
## 3 Lg Rank Offense 1 1 NA NA 2 10 1 NA 20 2 1 1 1.0 NA 13 10 12
## 4 Lg Rank Defense 3 4 NA NA 11 9 9 NA 25 11 3 9 5.0 NA 1 3 3
Or more concisely but with a lot of packages,
library(tidyverse) # for purrr functions and readr::type_convert
library(janitor) # for clean_names
df_list <- map(table, ~.x %>% clean_names() %>% dmap(as.character) %>% type_convert())
df_list[[1]]
## # A tibble: 4 × 23
## player pf yds ply y_p to fl x1std cmp att yds_2 td int ny_a
## <chr> <int> <int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
## 1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1
## 2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3
## 3 Lg Rank Offense 1 1 NA NA 2 10 1 NA 20 2 1 1 1.0
## 4 Lg Rank Defense 3 4 NA NA 11 9 9 NA 25 11 3 9 5.0
## # ... with 9 more variables: x1std_2 <int>, att_2 <int>, yds_3 <int>, td_2 <int>, y_a <dbl>,
## # x1std_3 <int>, pen <int>, yds_4 <int>, x1stpy <int>