C5_rules() in Tidymodels - tidymodels

I would like to use tidymodels to fit a C5.0 rule-based classification model. I have specified the model as follows
c5_spec <-
C5_rules() %>%
set_engine("C5.0") %>%
set_mode("classification")
In the documentation for the C5_rules() command, I read the following.
The model is not trained or fit until the fit.model_spec() function is used with the data.
I'm not quite sure what I need to do with the parsnip model object after that. Every time I try to fit the model, I get the following error
preprocessor 1/1, model 1/1 (predictions): Error in predict.C5.0(object = object$fit, newdata = new_data, type = "class"): either a tree or rules must be provided
What am I missing?
Thank you very much!

That's a good start! You've defined your model spec, but if you're wanting to fit using a workflow, you'll need to create a recipe & workflow as well. Julia Silge's blog is hands down the best resource for getting used to working with tidymodels. Here's a reprex that fits a C5 classifier once to training data:
# load tidymodels & rules
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(rules)
#> Warning: package 'rules' was built under R version 4.1.1
#>
#> Attaching package: 'rules'
#> The following object is masked from 'package:dials':
#>
#> max_rules
# example training dataset
cars_train <- as_tibble(mtcars)
# change the number of cylinders to character for predicting as a class
cars_train <-
cars_train %>%
mutate(cyl = as.character(cyl))
# training df
cars_train
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
# setup recipe with no preprocessing
cars_rec <-
recipe(cyl ~ ., data = cars_train)
# specify c5 model; no need to set mode (can only be used for classification)
cars_spec <-
C5_rules() %>%
set_engine("C5.0")
# create workflow
cars_wf <-
workflow() %>%
add_recipe(cars_rec) %>%
add_model(cars_spec)
# fit workflow
cars_fit <- fit(cars_wf, data = cars_train)
# add predictions to df
cars_preds <-
predict(cars_fit, new_data = cars_train) %>%
bind_cols(cars_train) %>%
select(.pred_class, cyl)
cars_preds
#> # A tibble: 32 x 2
#> .pred_class cyl
#> <fct> <chr>
#> 1 6 6
#> 2 6 6
#> 3 4 4
#> 4 6 6
#> 5 8 8
#> 6 6 6
#> 7 8 8
#> 8 4 4
#> 9 4 4
#> 10 6 6
#> # ... with 22 more rows
# confusion matrix
cars_preds %>%
conf_mat(truth = cyl,
estimate = .pred_class)
#> Warning in vec2table(truth = truth, estimate = estimate, dnn = dnn, ...): `truth`
#> was converted to a factor
#> Truth
#> Prediction 4 6 8
#> 4 11 0 0
#> 6 0 7 0
#> 8 0 0 14
Created on 2021-09-30 by the reprex package (v2.0.1)

I tried reprex by Mark Rieke and I got an error for the last command (conf_mat).
load tidymodels & rules
library(tidymodels)
library(rules)
#>
#> Attaching package: 'rules'
#> The following object is masked from 'package:dials':
#>
#> max_rules
# example training dataset
cars_train <- as_tibble(mtcars)
# change the number of cylinders to character for predicting as a class
cars_train <-
cars_train %>%
mutate(cyl = as.character(cyl))
# training df
cars_train
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # … with 22 more rows
# setup recipe with no preprocessing
cars_rec <-
recipe(cyl ~ ., data = cars_train)
# specify c5 model; no need to set mode (can only be used for classification)
cars_spec <-
C5_rules() %>%
set_engine("C5.0")
# create workflow
cars_wf <-
workflow() %>%
add_recipe(cars_rec) %>%
add_model(cars_spec)
# fit workflow
cars_fit <- fit(cars_wf, data = cars_train)
# add predictions to df
cars_preds <-
predict(cars_fit, new_data = cars_train) %>%
bind_cols(cars_train) %>%
select(.pred_class, cyl)
cars_preds
#> # A tibble: 32 × 2
#> .pred_class cyl
#> <fct> <chr>
#> 1 6 6
#> 2 6 6
#> 3 4 4
#> 4 6 6
#> 5 8 8
#> 6 6 6
#> 7 8 8
#> 8 4 4
#> 9 4 4
#> 10 6 6
#> # … with 22 more rows
# confusion matrix
cars_preds %>%
conf_mat(truth = cyl,
estimate = .pred_class)
#> Error in `yardstick_table()`:
#> ! `truth` must be a factor.
#> ℹ This is an internal error in the yardstick package, please report it to the package authors.
#> Backtrace:
#> ▆
#> 1. ├─cars_preds %>% conf_mat(truth = cyl, estimate = .pred_class)
#> 2. ├─yardstick::conf_mat(., truth = cyl, estimate = .pred_class)
#> 3. └─yardstick:::conf_mat.data.frame(., truth = cyl, estimate = .pred_class)
#> 4. └─yardstick:::yardstick_table(truth = truth, estimate = estimate, case_weights = case_weights)
#> 5. └─rlang::abort("`truth` must be a factor.", .internal = TRUE)

Related

Robust Scaler in recipes Package

Is there a robust scaler method in the recipes package in the R programming language? As a result of my research, I could not find this method.
I'm assuming you are referring to the RobustScaler from scikit-learn. You are correct that there isn't a similar step in the recipes package.
It is implemented in the extrasteps package which you can install with
# install.packages("devtools")
devtools::install_github("EmilHvitfeldt/extrasteps")
Then you can use the step_robust() which will do what you are expecting.
library(recipes)
library(extrasteps)
rec <- recipe(~., data = mtcars) %>%
step_robust(all_predictors()) %>%
prep()
rec %>%
bake(new_data = NULL)
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.244 0 -0.177 -0.156 0.244 -0.685 -0.623 0 1 0 1
#> 2 0.244 0 -0.177 -0.156 0.244 -0.437 -0.344 0 1 0 1
#> 3 0.488 -0.5 -0.430 -0.359 0.185 -0.977 0.448 1 1 0 -0.5
#> 4 0.298 0 0.301 -0.156 -0.732 -0.107 0.862 1 0 -1 -0.5
#> 5 -0.0678 0.5 0.798 0.623 -0.649 0.112 -0.344 0 0 -1 0
#> 6 -0.149 0 0.140 -0.216 -1.11 0.131 1.25 1 0 -1 -0.5
#> 7 -0.664 0.5 0.798 1.46 -0.577 0.238 -0.932 0 0 -1 1
#> 8 0.705 -0.5 -0.242 -0.731 -0.00595 -0.131 1.14 1 0 0 0
#> 9 0.488 -0.5 -0.271 -0.335 0.268 -0.170 2.59 1 0 0 0
#> 10 0 0 -0.140 0 0.268 0.112 0.294 1 0 0 1
#> # … with 22 more rows
tidy(rec, 1)
#> # A tibble: 33 × 4
#> terms statistic value id
#> <chr> <chr> <dbl> <chr>
#> 1 mpg lower 15.4 robust_hS9q6
#> 2 mpg median 19.2 robust_hS9q6
#> 3 mpg higher 22.8 robust_hS9q6
#> 4 cyl lower 4 robust_hS9q6
#> 5 cyl median 6 robust_hS9q6
#> 6 cyl higher 8 robust_hS9q6
#> 7 disp lower 121. robust_hS9q6
#> 8 disp median 196. robust_hS9q6
#> 9 disp higher 326 robust_hS9q6
#> 10 hp lower 96.5 robust_hS9q6
#> # … with 23 more rows
rec <- recipe(~., data = mtcars) %>%
step_robust(all_predictors(), range = c(0.1, 0.9)) %>%
prep()
rec %>%
bake(new_data = NULL)
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.114 0 -0.115 -0.0732 0.171 -0.337 -0.281 0 1 0
#> 2 0.114 0 -0.115 -0.0732 0.171 -0.215 -0.155 0 1 0
#> 3 0.229 -0.5 -0.280 -0.169 0.129 -0.480 0.202 1 1 0
#> 4 0.140 0 0.196 -0.0732 -0.512 -0.0526 0.388 1 0 -0.5
#> 5 -0.0317 0.5 0.519 0.293 -0.453 0.0550 -0.155 0 0 -0.5
#> 6 -0.0698 0 0.0910 -0.101 -0.778 0.0645 0.563 1 0 -0.5
#> 7 -0.311 0.5 0.519 0.687 -0.403 0.117 -0.420 0 0 -0.5
#> 8 0.330 -0.5 -0.157 -0.344 -0.00416 -0.0645 0.514 1 0 0
#> 9 0.229 -0.5 -0.176 -0.158 0.187 -0.0837 1.16 1 0 0
#> 10 0 0 -0.0910 0 0.187 0.0550 0.132 1 0 0
#> # … with 22 more rows, and 1 more variable: carb <dbl>
tidy(rec, 1)
#> # A tibble: 33 × 4
#> terms statistic value id
#> <chr> <chr> <dbl> <chr>
#> 1 mpg lower 14.3 robust_MygTA
#> 2 mpg median 19.2 robust_MygTA
#> 3 mpg higher 30.1 robust_MygTA
#> 4 cyl lower 4 robust_MygTA
#> 5 cyl median 6 robust_MygTA
#> 6 cyl higher 8 robust_MygTA
#> 7 disp lower 80.6 robust_MygTA
#> 8 disp median 196. robust_MygTA
#> 9 disp higher 396 robust_MygTA
#> 10 hp lower 66 robust_MygTA
#> # … with 23 more rows

R and DBI dbWriteTable connection to MySQL/MariaDB only imports first row

I'm using an AWS mariaDB to store some data. My idea was to do the full management with the DBI package. However, I have found that DBI only imports the first row of the data when I try to write a table in the db. I have to use DBI::dbCreateTable and dbx::dbxInsert. I can't figure out why DBI is not importing the full data frame.
I have gone through this post but the conclusion is not quite clear. This is the code/output:
con <- DBI::dbConnect(odbc::odbc(), "my_odbc", timeout = 10)
## Example 1 - doesn't work
DBI::dbWriteTable(con, "test1", mtcars)
DBI::dbReadTable(con, "test1")
row_names mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
# Example 2 - doesn't work
DBI::dbCreateTable(con, "test2", mtcars)
DBI::dbAppendTable(con, "test2", mtcars)
[1] 1
DBI::dbReadTable(con, "test2")
mpg cyl disp hp drat wt qsec vs am gear carb
1 21 6 160 110 3.9 2.62 16.46 0 1 4 4
# Example 3 - does work.
DBI::dbCreateTable(con, "test3", mtcars)
dbx::dbxInsert(con, "test3", mtcars)
DBI::dbReadTable(con, "test3")
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
I had a similar issue and if you aren't careful with how you define and use your primary keys you get this issue. The first row is allowed as its the first with that primary key and then the rows after are blocked and hence dont get inserted.

R: how to toggle html page selection in web scraping

library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("http://legacy.baseballprospectus.com/sortable/index.php?cid=2022181",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
I'm trying to scrape the 2016 table data from the above webpage. If I change the Year to 2010, the url changes to http://legacy.baseballprospectus.com/sortable/index.php?cid=1966487.
I want to automate my algorithm so that it can obtain the table across different Year, but I'm not sure how I can obtain the unique identifiers (e.g. 1966487) for each page automatically. Is there a way to find the list of these?
I've tried looking at the html source code, but no luck.
With rvest, you can set the value in the form and submit it. Wrapped in purrr::map_dfr to iterate and row-bind the results in to a data frame,
library(rvest)
sess <- html_session("http://legacy.baseballprospectus.com/sortable/index.php?cid=2022181")
baseball <- purrr::map_dfr(
2017:2015,
function(y){
Sys.sleep(10 + runif(1)) # be polite
form <- sess %>%
html_node(xpath = '//form[#action="index.php"]') %>%
html_form() %>%
set_values(year = y)
sess <- submit_form(sess, form)
sess %>%
read_html() %>%
html_node('#TTdata') %>%
html_table(header = TRUE)
}
)
tibble::as_data_frame(baseball) # for printing
#> # A tibble: 4,036 x 38
#> `#` NAME TEAM LG YEAR AGE G PA AB R
#> <dbl> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int>
#> 1 1 Giancarlo Stanton MIA NL 2017 27 159 692 597 123
#> 2 2 Joey Votto CIN NL 2017 33 162 707 559 106
#> 3 3 Charlie Blackmon COL NL 2017 30 159 725 644 137
#> 4 4 Aaron Judge NYA AL 2017 25 155 678 542 128
#> 5 5 Nolan Arenado COL NL 2017 26 159 680 606 100
#> 6 6 Kris Bryant CHN NL 2017 25 151 665 549 111
#> 7 7 Mike Trout ANA AL 2017 25 114 507 402 92
#> 8 8 Jose Altuve HOU AL 2017 27 153 662 590 112
#> 9 9 Paul Goldschmidt ARI NL 2017 29 155 665 558 117
#> 10 10 Jose Ramirez CLE AL 2017 24 152 645 585 107
#> # ... with 4,026 more rows, and 28 more variables: H <int>, `1B` <int>,
#> # `2B` <int>, `3B` <int>, HR <int>, TB <int>, BB <int>, IBB <int>,
#> # SO <int>, HBP <int>, SF <int>, SH <int>, RBI <int>, DP <int>,
#> # NETDP <dbl>, SB <int>, CS <int>, AVG <dbl>, OBP <dbl>, SLG <dbl>,
#> # OPS <dbl>, ISO <dbl>, BPF <int>, oppOPS <dbl>, TAv <dbl>, VORP <dbl>,
#> # FRAA <dbl>, BWARP <dbl>

list of lists of matrices (from JSON) into single data.frame - purrr has problems with differing row numbers?

I'm trying to use the information contained in keyed JSON names to add context to the data contained in their nested matrices. The matrices have different numbers of rows, and some of the matrices are missing (list element NULL). I am able to extract the relevant data and retain information as list names from the hierarchy using map and at_depth from the purrr package, but I cannot find a clean way to get this into a single data.frame.
I have attempted to use purrr:::transpose as exemplified here, and I've tried using tidyr:::unnest as shown here, but I think their desired results and inputs differ enough from mine that they are not applicable. There seems to be too many problems with the differing row names and/or the missing matrices. I am also new to the purrr package, so there could be something simple that I'm missing here.
Here is my own attempt which produces nearly the desired result, and I think I could modify it a bit more to remove the for loop and have another layer of some 'apply' functions, but I have the suspicion that there are better ways to go about this.
Minimal reproducible Example
#Download data
json <- getURL("http://maps2.dnr.state.mn.us/cgi-bin/lakefinder/detail.cgi?type=lake_survey&id=69070100")
#Surveys are the relevant data
data.listed <- fromJSON(json, simplifyDataFrame=F)
surveys <- data.listed$result$surveys
#Get list of lists of matrices - fish size count data
fcounts <- map(surveys, "lengths") %>%
at_depth(2, "fishCount") %>%
at_depth(2, data.frame) # side note: is this a good way to the inner matrices to data.frames?
#top-level - list - surveys
#2nd-level - list - species in each survey
#3rd-level - data.frame - X1: measured_size, X2: counts
#use survey IDs as names for top level list
#just as species are used as names for 2nd level lists
names(fcounts) <- sapply(surveys, function(s) {return(s$surveyID)})
#This produces nearly the correct result
for (i in 1:length(fcounts)){
surv.id <- names(fcounts)[[i]]
if (length(fcounts[[i]]) > 0) {
listed.withSpecies <- lapply(names(fcounts[[i]]), function(species) cbind(fcounts[[i]][[species]], species))
surv.fishCounts <- do.call(rbind, listed.withSpecies)
colnames(surv.fishCounts) <- c("size", "count", "species")
surv.fishCounts$survey.ID <- surv.id
print(surv.fishCounts)
}
}
This is one way to get nested data frames of the lengths counts into a big data frame:
library(httr)
library(tidyverse)
res <- GET("http://maps2.dnr.state.mn.us/cgi-bin/lakefinder/detail.cgi",
query = list(type="lake_survey", id="69070100"))
content(res, as="text") %>%
jsonlite::fromJSON(simplifyDataFrame = FALSE, flatten=FALSE) -> x
x$result$surveys %>%
map_df(~{
tmp_df <- flatten_df(.x[c("surveyDate", "surveyID", "surveyType", "surveySubType")])
lens <- .x$lengths
if (length(lens) > 0) {
fish <- names(lens)
data_frame(fish,
max_length = map_dbl(lens, "maximum_length"),
min_length = map_dbl(lens, "minimum_length"),
lens = map(lens, "fishCount") %>%
map(~set_names(as_data_frame(.), c("catch_len", "ct")))) %>%
mutate(surveyDate = tmp_df$surveyDate,
surveyType = tmp_df$surveyType,
surveySubType = tmp_df$surveySubType,
surveyID = tmp_df$surveyID) -> tmp_df
}
tmp_df
}) -> lengths_df
glimpse(lengths_df)
## Observations: 21
## Variables: 8
## $ surveyDate <chr> "1988-07-19", "1995-07-17", "1995-07-17", "1995-07-17", "1995-07-17", "1995-07-17", "1995-07-...
## $ surveyID <chr> "107278", "107539", "107539", "107539", "107539", "107539", "107539", "107539", "107539", "10...
## $ surveyType <chr> "Standard Survey", "Standard Survey", "Standard Survey", "Standard Survey", "Standard Survey"...
## $ surveySubType <chr> "Population Assessment", "Re-Survey", "Re-Survey", "Re-Survey", "Re-Survey", "Re-Survey", "Re...
## $ fish <chr> NA, "PMK", "BLB", "LMB", "YEP", "BLG", "WTS", "WAE", "NOP", "GSF", "BLC", NA, "HSF", "PMK", "...
## $ max_length <dbl> NA, 6, 12, 16, 6, 7, 18, 18, 36, 4, 10, NA, 8, 7, 12, 12, 6, 8, 23, 38, 12
## $ min_length <dbl> NA, 3, 10, 1, 3, 3, 16, 16, 6, 4, 4, NA, 7, 4, 10, 12, 5, 3, 12, 9, 7
## $ lens <list> [NULL, <c("3", "6"), c("1", "3")>, <c("10", "11", "12"), c("1", "1", "4")>, <c("1", "16", "2...
print(lengths_df, n=nrow(lengths_df))
## # A tibble: 21 × 8
## surveyDate surveyID surveyType surveySubType fish max_length min_length lens
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <list>
## 1 1988-07-19 107278 Standard Survey Population Assessment <NA> NA NA <NULL>
## 2 1995-07-17 107539 Standard Survey Re-Survey PMK 6 3 <tibble [2 × 2]>
## 3 1995-07-17 107539 Standard Survey Re-Survey BLB 12 10 <tibble [3 × 2]>
## 4 1995-07-17 107539 Standard Survey Re-Survey LMB 16 1 <tibble [6 × 2]>
## 5 1995-07-17 107539 Standard Survey Re-Survey YEP 6 3 <tibble [3 × 2]>
## 6 1995-07-17 107539 Standard Survey Re-Survey BLG 7 3 <tibble [5 × 2]>
## 7 1995-07-17 107539 Standard Survey Re-Survey WTS 18 16 <tibble [3 × 2]>
## 8 1995-07-17 107539 Standard Survey Re-Survey WAE 18 16 <tibble [2 × 2]>
## 9 1995-07-17 107539 Standard Survey Re-Survey NOP 36 6 <tibble [17 × 2]>
## 10 1995-07-17 107539 Standard Survey Re-Survey GSF 4 4 <tibble [1 × 2]>
## 11 1995-07-17 107539 Standard Survey Re-Survey BLC 10 4 <tibble [6 × 2]>
## 12 1992-07-24 107587 Standard Survey Re-Survey <NA> NA NA <NULL>
## 13 2005-07-11 107906 Standard Survey Population Assessment HSF 8 7 <tibble [2 × 2]>
## 14 2005-07-11 107906 Standard Survey Population Assessment PMK 7 4 <tibble [4 × 2]>
## 15 2005-07-11 107906 Standard Survey Population Assessment BLB 12 10 <tibble [3 × 2]>
## 16 2005-07-11 107906 Standard Survey Population Assessment LMB 12 12 <tibble [1 × 2]>
## 17 2005-07-11 107906 Standard Survey Population Assessment YEP 6 5 <tibble [2 × 2]>
## 18 2005-07-11 107906 Standard Survey Population Assessment BLG 8 3 <tibble [6 × 2]>
## 19 2005-07-11 107906 Standard Survey Population Assessment WAE 23 12 <tibble [8 × 2]>
## 20 2005-07-11 107906 Standard Survey Population Assessment NOP 38 9 <tibble [20 × 2]>
## 21 2005-07-11 107906 Standard Survey Population Assessment BLC 12 7 <tibble [4 × 2]>
You can expand the nested catch observations this way:
filter(lengths_df, !map_lgl(lens, is.null)) %>%
unnest(lens)
## # A tibble: 98 × 9
## surveyDate surveyID surveyType surveySubType fish max_length min_length catch_len ct
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <int> <int>
## 1 1995-07-17 107539 Standard Survey Re-Survey PMK 6 3 3 1
## 2 1995-07-17 107539 Standard Survey Re-Survey PMK 6 3 6 3
## 3 1995-07-17 107539 Standard Survey Re-Survey BLB 12 10 10 1
## 4 1995-07-17 107539 Standard Survey Re-Survey BLB 12 10 11 1
## 5 1995-07-17 107539 Standard Survey Re-Survey BLB 12 10 12 4
## 6 1995-07-17 107539 Standard Survey Re-Survey LMB 16 1 1 1
## 7 1995-07-17 107539 Standard Survey Re-Survey LMB 16 1 16 1
## 8 1995-07-17 107539 Standard Survey Re-Survey LMB 16 1 2 6
## 9 1995-07-17 107539 Standard Survey Re-Survey LMB 16 1 4 4
## 10 1995-07-17 107539 Standard Survey Re-Survey LMB 16 1 5 2
## # ... with 88 more rows

How can I replace empty cells with NA in R?

I'm new to R, and have been trying a bunch of examples but I couldn't get anything to change all of my empty cells into NA.
library(XML)
theurl <- "http://www.pro-football-reference.com/teams/sfo/1989.htm"
table <- readHTMLTable(theurl)
table
Thank you.
The result you get from readHTMLTable is giving you a list of two tables, so you need to work on each list element, which can be done using lapply
table <- lapply(table, function(x){
x[x == ""] <- NA
return(x)
})
table$team_stats
Player PF Yds Ply Y/P TO FL 1stD Cmp Att Yds TD Int NY/A 1stD Att Yds TD Y/A 1stD Pen Yds 1stPy
1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1 209 493 1966 14 4.0 124 109 922 17
2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3 178 372 1383 9 3.7 76 75 581 29
3 Lg Rank Offense 1 1 <NA> <NA> 2 10 1 <NA> 20 2 1 1 1 <NA> 13 10 12 13 <NA> <NA> <NA> <NA>
4 Lg Rank Defense 3 4 <NA> <NA> 11 9 9 <NA> 25 11 3 9 5 <NA> 1 3 3 8 <NA> <NA> <NA> <NA>
You have a list of data.frames of factors, though the actual data is mostly numeric. Converting to the appropriate type with type.convert will automatically insert the appropriate NAs for you:
df_list <- lapply(table, function(x){
x[] <- lapply(x, function(y){type.convert(as.character(y), as.is = TRUE)});
x
})
df_list[[1]][, 1:18]
## Player PF Yds Ply Y/P TO FL 1stD Cmp Att Yds.1 TD Int NY/A 1stD.1 Att.1 Yds.2 TD.1
## 1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1 209 493 1966 14
## 2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3 178 372 1383 9
## 3 Lg Rank Offense 1 1 NA NA 2 10 1 NA 20 2 1 1 1.0 NA 13 10 12
## 4 Lg Rank Defense 3 4 NA NA 11 9 9 NA 25 11 3 9 5.0 NA 1 3 3
Or more concisely but with a lot of packages,
library(tidyverse) # for purrr functions and readr::type_convert
library(janitor) # for clean_names
df_list <- map(table, ~.x %>% clean_names() %>% dmap(as.character) %>% type_convert())
df_list[[1]]
## # A tibble: 4 × 23
## player pf yds ply y_p to fl x1std cmp att yds_2 td int ny_a
## <chr> <int> <int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
## 1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1
## 2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3
## 3 Lg Rank Offense 1 1 NA NA 2 10 1 NA 20 2 1 1 1.0
## 4 Lg Rank Defense 3 4 NA NA 11 9 9 NA 25 11 3 9 5.0
## # ... with 9 more variables: x1std_2 <int>, att_2 <int>, yds_3 <int>, td_2 <int>, y_a <dbl>,
## # x1std_3 <int>, pen <int>, yds_4 <int>, x1stpy <int>