How can I replace empty cells with NA in R? - html

I'm new to R, and have been trying a bunch of examples but I couldn't get anything to change all of my empty cells into NA.
library(XML)
theurl <- "http://www.pro-football-reference.com/teams/sfo/1989.htm"
table <- readHTMLTable(theurl)
table
Thank you.

The result you get from readHTMLTable is giving you a list of two tables, so you need to work on each list element, which can be done using lapply
table <- lapply(table, function(x){
x[x == ""] <- NA
return(x)
})
table$team_stats
Player PF Yds Ply Y/P TO FL 1stD Cmp Att Yds TD Int NY/A 1stD Att Yds TD Y/A 1stD Pen Yds 1stPy
1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1 209 493 1966 14 4.0 124 109 922 17
2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3 178 372 1383 9 3.7 76 75 581 29
3 Lg Rank Offense 1 1 <NA> <NA> 2 10 1 <NA> 20 2 1 1 1 <NA> 13 10 12 13 <NA> <NA> <NA> <NA>
4 Lg Rank Defense 3 4 <NA> <NA> 11 9 9 <NA> 25 11 3 9 5 <NA> 1 3 3 8 <NA> <NA> <NA> <NA>

You have a list of data.frames of factors, though the actual data is mostly numeric. Converting to the appropriate type with type.convert will automatically insert the appropriate NAs for you:
df_list <- lapply(table, function(x){
x[] <- lapply(x, function(y){type.convert(as.character(y), as.is = TRUE)});
x
})
df_list[[1]][, 1:18]
## Player PF Yds Ply Y/P TO FL 1stD Cmp Att Yds.1 TD Int NY/A 1stD.1 Att.1 Yds.2 TD.1
## 1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1 209 493 1966 14
## 2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3 178 372 1383 9
## 3 Lg Rank Offense 1 1 NA NA 2 10 1 NA 20 2 1 1 1.0 NA 13 10 12
## 4 Lg Rank Defense 3 4 NA NA 11 9 9 NA 25 11 3 9 5.0 NA 1 3 3
Or more concisely but with a lot of packages,
library(tidyverse) # for purrr functions and readr::type_convert
library(janitor) # for clean_names
df_list <- map(table, ~.x %>% clean_names() %>% dmap(as.character) %>% type_convert())
df_list[[1]]
## # A tibble: 4 × 23
## player pf yds ply y_p to fl x1std cmp att yds_2 td int ny_a
## <chr> <int> <int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
## 1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1
## 2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3
## 3 Lg Rank Offense 1 1 NA NA 2 10 1 NA 20 2 1 1 1.0
## 4 Lg Rank Defense 3 4 NA NA 11 9 9 NA 25 11 3 9 5.0
## # ... with 9 more variables: x1std_2 <int>, att_2 <int>, yds_3 <int>, td_2 <int>, y_a <dbl>,
## # x1std_3 <int>, pen <int>, yds_4 <int>, x1stpy <int>

Related

C5_rules() in Tidymodels

I would like to use tidymodels to fit a C5.0 rule-based classification model. I have specified the model as follows
c5_spec <-
C5_rules() %>%
set_engine("C5.0") %>%
set_mode("classification")
In the documentation for the C5_rules() command, I read the following.
The model is not trained or fit until the fit.model_spec() function is used with the data.
I'm not quite sure what I need to do with the parsnip model object after that. Every time I try to fit the model, I get the following error
preprocessor 1/1, model 1/1 (predictions): Error in predict.C5.0(object = object$fit, newdata = new_data, type = "class"): either a tree or rules must be provided
What am I missing?
Thank you very much!
That's a good start! You've defined your model spec, but if you're wanting to fit using a workflow, you'll need to create a recipe & workflow as well. Julia Silge's blog is hands down the best resource for getting used to working with tidymodels. Here's a reprex that fits a C5 classifier once to training data:
# load tidymodels & rules
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(rules)
#> Warning: package 'rules' was built under R version 4.1.1
#>
#> Attaching package: 'rules'
#> The following object is masked from 'package:dials':
#>
#> max_rules
# example training dataset
cars_train <- as_tibble(mtcars)
# change the number of cylinders to character for predicting as a class
cars_train <-
cars_train %>%
mutate(cyl = as.character(cyl))
# training df
cars_train
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
# setup recipe with no preprocessing
cars_rec <-
recipe(cyl ~ ., data = cars_train)
# specify c5 model; no need to set mode (can only be used for classification)
cars_spec <-
C5_rules() %>%
set_engine("C5.0")
# create workflow
cars_wf <-
workflow() %>%
add_recipe(cars_rec) %>%
add_model(cars_spec)
# fit workflow
cars_fit <- fit(cars_wf, data = cars_train)
# add predictions to df
cars_preds <-
predict(cars_fit, new_data = cars_train) %>%
bind_cols(cars_train) %>%
select(.pred_class, cyl)
cars_preds
#> # A tibble: 32 x 2
#> .pred_class cyl
#> <fct> <chr>
#> 1 6 6
#> 2 6 6
#> 3 4 4
#> 4 6 6
#> 5 8 8
#> 6 6 6
#> 7 8 8
#> 8 4 4
#> 9 4 4
#> 10 6 6
#> # ... with 22 more rows
# confusion matrix
cars_preds %>%
conf_mat(truth = cyl,
estimate = .pred_class)
#> Warning in vec2table(truth = truth, estimate = estimate, dnn = dnn, ...): `truth`
#> was converted to a factor
#> Truth
#> Prediction 4 6 8
#> 4 11 0 0
#> 6 0 7 0
#> 8 0 0 14
Created on 2021-09-30 by the reprex package (v2.0.1)
I tried reprex by Mark Rieke and I got an error for the last command (conf_mat).
load tidymodels & rules
library(tidymodels)
library(rules)
#>
#> Attaching package: 'rules'
#> The following object is masked from 'package:dials':
#>
#> max_rules
# example training dataset
cars_train <- as_tibble(mtcars)
# change the number of cylinders to character for predicting as a class
cars_train <-
cars_train %>%
mutate(cyl = as.character(cyl))
# training df
cars_train
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # … with 22 more rows
# setup recipe with no preprocessing
cars_rec <-
recipe(cyl ~ ., data = cars_train)
# specify c5 model; no need to set mode (can only be used for classification)
cars_spec <-
C5_rules() %>%
set_engine("C5.0")
# create workflow
cars_wf <-
workflow() %>%
add_recipe(cars_rec) %>%
add_model(cars_spec)
# fit workflow
cars_fit <- fit(cars_wf, data = cars_train)
# add predictions to df
cars_preds <-
predict(cars_fit, new_data = cars_train) %>%
bind_cols(cars_train) %>%
select(.pred_class, cyl)
cars_preds
#> # A tibble: 32 × 2
#> .pred_class cyl
#> <fct> <chr>
#> 1 6 6
#> 2 6 6
#> 3 4 4
#> 4 6 6
#> 5 8 8
#> 6 6 6
#> 7 8 8
#> 8 4 4
#> 9 4 4
#> 10 6 6
#> # … with 22 more rows
# confusion matrix
cars_preds %>%
conf_mat(truth = cyl,
estimate = .pred_class)
#> Error in `yardstick_table()`:
#> ! `truth` must be a factor.
#> ℹ This is an internal error in the yardstick package, please report it to the package authors.
#> Backtrace:
#> ▆
#> 1. ├─cars_preds %>% conf_mat(truth = cyl, estimate = .pred_class)
#> 2. ├─yardstick::conf_mat(., truth = cyl, estimate = .pred_class)
#> 3. └─yardstick:::conf_mat.data.frame(., truth = cyl, estimate = .pred_class)
#> 4. └─yardstick:::yardstick_table(truth = truth, estimate = estimate, case_weights = case_weights)
#> 5. └─rlang::abort("`truth` must be a factor.", .internal = TRUE)

Problem with retrieving a table from Chinook dataset due to character

I have created a database in MySQL with data from the Chinook dataset, which has fictitious information on customers that buy music.
One of the tables ("Invoice"), has the billing addresses, which has characters in diverse languages:
InvoiceId CustomerId InvoiceDate BillingAddress
1 2 2009-01-01 00:00:00 Theodor-Heuss-Straße 34
2 4 2009-01-02 00:00:00 Ullevålsveien 14
3 8 2009-01-03 00:00:00 Grétrystraat 63
4 14 2009-01-06 00:00:00 8210 111 ST NW
I tried to retrieve the data using R, with the following code:
library(DBI)
library(RMySQL)
library(dplyr)
library(magrittr)
library(lubridate)
library(stringi)
# Step 1 - Connect to the database ----------------------------------------
con <- DBI::dbConnect(MySQL(),
dbname = Sys.getenv("DB_CHINOOK"),
host = Sys.getenv("HST_CHINOOK"),
user = Sys.getenv("USR_CHINOOK"),
password = Sys.getenv("PASS_CHINOOK"),
port = XXXX)
invoices_tbl <- tbl(con, "Invoice") %>%
collect()
The connection is ok, but when trying to visualize the data, I can't see the special characters:
> head(invoices_tbl[,1:4])
# A tibble: 6 x 4
InvoiceId CustomerId InvoiceDate BillingAddress
<int> <int> <chr> <chr>
1 1 2 2009-01-01 00:00:00 "Theodor-Heuss-Stra\xdfe 34"
2 2 4 2009-01-02 00:00:00 "Ullev\xe5lsveien 14"
3 3 8 2009-01-03 00:00:00 "Gr\xe9trystraat 63"
4 4 14 2009-01-06 00:00:00 "8210 111 ST NW"
5 5 23 2009-01-11 00:00:00 "69 Salem Street"
6 6 37 2009-01-19 00:00:00 "Berger Stra\xdfe 10"
My question is, should I change something in the configuration inside MySQL? Or is it an issue with R? How can I see the special characters? What is the meaning of \xdfe?
Please, any help will be greatly appreciated.
The hexadecimal format can be converted with iconv
invoices_tbl$BillingAddress <- iconv(invoices_tbl$BillingAddress,
"latin1", "utf-8")
-output
invoices_tbl
InvoiceId CustomerId InvoiceDate BillingAddress
1 1 2 2009-01-01 00:00:00 Theodor-Heuss-Straße 34
2 2 4 2009-01-02 00:00:00 Ullevålsveien 14
3 3 8 2009-01-03 00:00:00 Grétrystraat 63
4 4 14 2009-01-06 00:00:00 8210 111 ST NW
5 5 23 2009-01-11 00:00:00 69 Salem Street
6 6 37 2009-01-19 00:00:00 Berger Straße 10
data
invoices_tbl <- structure(list(InvoiceId = 1:6, CustomerId = c(2L, 4L, 8L, 14L,
23L, 37L), InvoiceDate = c("2009-01-01 00:00:00", "2009-01-02 00:00:00",
"2009-01-03 00:00:00", "2009-01-06 00:00:00", "2009-01-11 00:00:00",
"2009-01-19 00:00:00"), BillingAddress = c("Theodor-Heuss-Stra\xdfe 34",
"Ullev\xe5lsveien 14", "Gr\xe9trystraat 63", "8210 111 ST NW",
"69 Salem Street", "Berger Stra\xdfe 10")), row.names = c("1",
"2", "3", "4", "5", "6"), class = "data.frame")

web scraping understat website to retrieve table failing in R

I am trying to to pull out a table from the the website https://understat.com/league/EPL
The table I am trying to import into R is highlighted in red in the screenshot here;
screenshot of website
Using inspect tools I can see the xpath to the table as follows;
//*[#id="league-chemp"]/table
full XPath is
/html/body/div[1]/div[3]/div[3]/div/div[2]/div/table
My code is as follows;
library(rvest)
library(selectr)
library(xml2)
library(jsonlite)
library(htmltab)
library(RCurl)
library(XML)
url <- 'https://understat.com/league/EPL'
webpage <- read_html('https://understat.com/league/EPL')
xpath <- "/html/body/div[1]/div[3]/div[3]/div/div[2]/div/table/tbody"
nodes <- html_nodes(webpage, xpath = xpath)
However the response is;
> nodes
{xml_nodeset (0)}
I've hit a dead end, I think there maybe some embedded JSON code and javascript within the main html body of the response that is causing issues, but its all above my expertise right now.
I have been able to extract the table with the following code :
library(rvest)
library(RSelenium)
port <- as.integer(4444L + rpois(lambda = 1000, 1))
rd <- rsDriver(chromever = "105.0.5195.52", browser = "chrome", port = port)
remDr <- rd$client
remDr$open()
url <- "https://understat.com/league/EPL"
remDr$navigate(url)
Sys.sleep(5)
html_Content <- remDr$getPageSource()[[1]]
tables <- read_html(html_Content) %>% html_table()
tables
[[1]]
# A tibble: 20 x 12
`<U+2116>` Team M W D L G GA PTS xG xGA xPTS
<int> <chr> <int> <int> <int> <int> <int> <int> <int> <chr> <chr> <chr>
1 1 Arsenal 9 8 0 1 23 10 24 19.48-3.52 8.17-1.83 20.03-3.97
2 2 Manchester City 9 7 2 0 33 9 23 23.27-9.73 5.81-3.19 23.59+0.59
3 3 Tottenham 9 6 2 1 20 10 20 14.78-5.22 10.60+0.60 15.12-4.88
4 4 Chelsea 8 5 1 2 13 10 16 12.10-0.90 10.62+0.62 11.86-4.14
5 5 Manchester United 8 5 0 3 13 15 15 12.35-0.65 11.41-3.59 11.86-3.14
6 6 Newcastle United 9 3 5 1 17 9 14 18.41+1.41 12.13+3.13 15.73+1.73
7 7 Brighton 8 4 2 2 14 9 14 14.52+0.52 8.58-0.42 15.53+1.53
8 8 Bournemouth 9 3 3 3 8 20 12 5.26-2.74 15.39-4.61 6.43-5.57
9 9 Fulham 9 3 2 4 14 18 11 10.22-3.78 21.34+3.34 7.02-3.98
10 10 Liverpool 8 2 4 2 20 12 10 17.02-2.98 12.33+0.33 12.95+2.95
11 11 Brentford 9 2 4 3 16 17 10 13.28-2.72 13.00-4.00 12.78+2.78
12 12 Everton 9 2 4 3 8 9 10 10.33+2.33 14.67+5.67 9.13-0.87
13 13 West Ham 9 3 1 5 8 10 10 11.51+3.51 9.64-0.36 13.64+3.64
14 14 Leeds 8 2 3 3 11 12 9 10.45-0.55 12.28+0.28 9.73+0.73
15 15 Crystal Palace 8 2 3 3 10 12 9 9.91-0.09 13.71+1.71 8.62-0.38
16 16 Aston Villa 8 2 2 4 6 10 8 8.08+2.08 10.45+0.45 10.24+2.24
17 17 Southampton 9 2 1 6 8 17 7 9.32+1.32 13.88-3.12 9.36+2.36
18 18 Wolverhampton Wanderers 9 1 3 5 3 12 6 8.16+5.16 11.84-0.16 9.54+3.54
19 19 Leicester 9 1 1 7 15 24 4 9.06-5.94 15.12-8.88 8.00+4.00
20 20 Nottingham Forest 8 1 1 6 6 21 4 8.62+2.62 15.17-5.83 7.48+3.48
[[2]]
# A tibble: 11 x 11
`<U+2116>` Player Team Apps Min G A xG xA xG90 xA90
<int> <chr> <chr> <int> <int> <int> <int> <chr> <chr> <dbl> <dbl>
1 1 "Erling Haaland" "Manchester City" 9 768 15 3 10.10-4.90 2.61-0.39 1.18 0.31
2 2 "Harry Kane" "Tottenham" 9 804 8 1 6.72-1.28 2.06+1.06 0.75 0.23
3 3 "Roberto Firmino" "Liverpool" 7 473 6 3 4.06-1.94 1.40-1.60 0.77 0.27
4 4 "Aleksandar Mitrovic" "Fulham" 8 666 6 0 4.53-1.47 0.38+0.38 0.61 0.05
5 5 "Ivan Toney" "Brentford" 9 810 6 2 5.24-0.76 1.55-0.45 0.58 0.17
6 6 "Phil Foden" "Manchester City" 9 678 6 4 3.37-2.63 2.49-1.51 0.45 0.33
7 7 "Gabriel Jesus" "Arsenal" 9 794 5 3 6.29+1.29 1.73-1.27 0.71 0.2
8 8 "James Maddison" "Leicester" 8 716 5 2 1.40-3.60 0.97-1.03 0.18 0.12
9 9 "Leandro Trossard" "Brighton" 8 686 5 1 3.01-1.99 0.70-0.30 0.39 0.09
10 10 "Wilfried Zaha" "Crystal Palace" 7 624 4 1 3.33-0.67 1.24+0.24 0.48 0.18
11 NA "" "" NA NA 252 183 250.82-1.18 180.23-2.77 NA NA
Here is another approach that can be considered :
library(RDCOMClient)
url <- "https://understat.com/league/EPL"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
html_Content <- doc$Body()$innerHTML()
tables <- read_html(html_Content) %>% html_table()
tables
[[1]]
# A tibble: 20 x 12
`?` Team M W D L G GA PTS xG xGA xPTS
<int> <chr> <int> <int> <int> <int> <int> <int> <int> <chr> <chr> <chr>
1 1 Arsenal 9 8 0 1 23 10 24 19.48-3.52 8.17-1.83 20.03-3.97
2 2 Manchester City 9 7 2 0 33 9 23 23.27-9.73 5.81-3.19 23.59+0.59
3 3 Tottenham 9 6 2 1 20 10 20 14.78-5.22 10.60+0.60 15.12-4.88
4 4 Chelsea 8 5 1 2 13 10 16 12.10-0.90 10.62+0.62 11.86-4.14
5 5 Manchester United 8 5 0 3 13 15 15 12.35-0.65 11.41-3.59 11.86-3.14
6 6 Newcastle United 9 3 5 1 17 9 14 18.41+1.41 12.13+3.13 15.73+1.73
7 7 Brighton 8 4 2 2 14 9 14 14.52+0.52 8.58-0.42 15.53+1.53
8 8 Bournemouth 9 3 3 3 8 20 12 5.26-2.74 15.39-4.61 6.43-5.57
9 9 Fulham 9 3 2 4 14 18 11 10.22-3.78 21.34+3.34 7.02-3.98
10 10 Liverpool 8 2 4 2 20 12 10 17.02-2.98 12.33+0.33 12.95+2.95
11 11 Brentford 9 2 4 3 16 17 10 13.28-2.72 13.00-4.00 12.78+2.78
12 12 Everton 9 2 4 3 8 9 10 10.33+2.33 14.67+5.67 9.13-0.87
13 13 West Ham 9 3 1 5 8 10 10 11.51+3.51 9.64-0.36 13.64+3.64
14 14 Leeds 8 2 3 3 11 12 9 10.45-0.55 12.28+0.28 9.73+0.73
15 15 Crystal Palace 8 2 3 3 10 12 9 9.91-0.09 13.71+1.71 8.62-0.38
16 16 Aston Villa 8 2 2 4 6 10 8 8.08+2.08 10.45+0.45 10.24+2.24
17 17 Southampton 9 2 1 6 8 17 7 9.32+1.32 13.88-3.12 9.36+2.36
18 18 Wolverhampton Wanderers 9 1 3 5 3 12 6 8.16+5.16 11.84-0.16 9.54+3.54
19 19 Leicester 9 1 1 7 15 24 4 9.06-5.94 15.12-8.88 8.00+4.00
20 20 Nottingham Forest 8 1 1 6 6 21 4 8.62+2.62 15.17-5.83 7.48+3.48
[[2]]
# A tibble: 11 x 11
`?` Player Team Apps Min G A xG xA xG90 xA90
<int> <chr> <chr> <int> <int> <int> <int> <chr> <chr> <dbl> <dbl>
1 1 "Erling Haaland" "Manchester City" 9 768 15 3 10.10-4.90 2.61-0.39 1.18 0.31
2 2 "Harry Kane" "Tottenham" 9 804 8 1 6.72-1.28 2.06+1.06 0.75 0.23
3 3 "Roberto Firmino" "Liverpool" 7 473 6 3 4.06-1.94 1.40-1.60 0.77 0.27
4 4 "Aleksandar Mitrovic" "Fulham" 8 666 6 0 4.53-1.47 0.38+0.38 0.61 0.05
5 5 "Ivan Toney" "Brentford" 9 810 6 2 5.24-0.76 1.55-0.45 0.58 0.17
6 6 "Phil Foden" "Manchester City" 9 678 6 4 3.37-2.63 2.49-1.51 0.45 0.33
7 7 "Gabriel Jesus" "Arsenal" 9 794 5 3 6.29+1.29 1.73-1.27 0.71 0.2
8 8 "James Maddison" "Leicester" 8 716 5 2 1.40-3.60 0.97-1.03 0.18 0.12
9 9 "Leandro Trossard" "Brighton" 8 686 5 1 3.01-1.99 0.70-0.30 0.39 0.09
10 10 "Wilfried Zaha" "Crystal Palace" 7 624 4 1 3.33-0.67 1.24+0.24 0.48 0.18
11 NA "" "" NA NA 252 183 250.82-1.18 180.23-2.77 NA NA

R and DBI dbWriteTable connection to MySQL/MariaDB only imports first row

I'm using an AWS mariaDB to store some data. My idea was to do the full management with the DBI package. However, I have found that DBI only imports the first row of the data when I try to write a table in the db. I have to use DBI::dbCreateTable and dbx::dbxInsert. I can't figure out why DBI is not importing the full data frame.
I have gone through this post but the conclusion is not quite clear. This is the code/output:
con <- DBI::dbConnect(odbc::odbc(), "my_odbc", timeout = 10)
## Example 1 - doesn't work
DBI::dbWriteTable(con, "test1", mtcars)
DBI::dbReadTable(con, "test1")
row_names mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
# Example 2 - doesn't work
DBI::dbCreateTable(con, "test2", mtcars)
DBI::dbAppendTable(con, "test2", mtcars)
[1] 1
DBI::dbReadTable(con, "test2")
mpg cyl disp hp drat wt qsec vs am gear carb
1 21 6 160 110 3.9 2.62 16.46 0 1 4 4
# Example 3 - does work.
DBI::dbCreateTable(con, "test3", mtcars)
dbx::dbxInsert(con, "test3", mtcars)
DBI::dbReadTable(con, "test3")
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
I had a similar issue and if you aren't careful with how you define and use your primary keys you get this issue. The first row is allowed as its the first with that primary key and then the rows after are blocked and hence dont get inserted.

R: how to toggle html page selection in web scraping

library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("http://legacy.baseballprospectus.com/sortable/index.php?cid=2022181",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
I'm trying to scrape the 2016 table data from the above webpage. If I change the Year to 2010, the url changes to http://legacy.baseballprospectus.com/sortable/index.php?cid=1966487.
I want to automate my algorithm so that it can obtain the table across different Year, but I'm not sure how I can obtain the unique identifiers (e.g. 1966487) for each page automatically. Is there a way to find the list of these?
I've tried looking at the html source code, but no luck.
With rvest, you can set the value in the form and submit it. Wrapped in purrr::map_dfr to iterate and row-bind the results in to a data frame,
library(rvest)
sess <- html_session("http://legacy.baseballprospectus.com/sortable/index.php?cid=2022181")
baseball <- purrr::map_dfr(
2017:2015,
function(y){
Sys.sleep(10 + runif(1)) # be polite
form <- sess %>%
html_node(xpath = '//form[#action="index.php"]') %>%
html_form() %>%
set_values(year = y)
sess <- submit_form(sess, form)
sess %>%
read_html() %>%
html_node('#TTdata') %>%
html_table(header = TRUE)
}
)
tibble::as_data_frame(baseball) # for printing
#> # A tibble: 4,036 x 38
#> `#` NAME TEAM LG YEAR AGE G PA AB R
#> <dbl> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int>
#> 1 1 Giancarlo Stanton MIA NL 2017 27 159 692 597 123
#> 2 2 Joey Votto CIN NL 2017 33 162 707 559 106
#> 3 3 Charlie Blackmon COL NL 2017 30 159 725 644 137
#> 4 4 Aaron Judge NYA AL 2017 25 155 678 542 128
#> 5 5 Nolan Arenado COL NL 2017 26 159 680 606 100
#> 6 6 Kris Bryant CHN NL 2017 25 151 665 549 111
#> 7 7 Mike Trout ANA AL 2017 25 114 507 402 92
#> 8 8 Jose Altuve HOU AL 2017 27 153 662 590 112
#> 9 9 Paul Goldschmidt ARI NL 2017 29 155 665 558 117
#> 10 10 Jose Ramirez CLE AL 2017 24 152 645 585 107
#> # ... with 4,026 more rows, and 28 more variables: H <int>, `1B` <int>,
#> # `2B` <int>, `3B` <int>, HR <int>, TB <int>, BB <int>, IBB <int>,
#> # SO <int>, HBP <int>, SF <int>, SH <int>, RBI <int>, DP <int>,
#> # NETDP <dbl>, SB <int>, CS <int>, AVG <dbl>, OBP <dbl>, SLG <dbl>,
#> # OPS <dbl>, ISO <dbl>, BPF <int>, oppOPS <dbl>, TAv <dbl>, VORP <dbl>,
#> # FRAA <dbl>, BWARP <dbl>