web scraping understat website to retrieve table failing in R

web scraping understat website to retrieve table failing in R - html

I am trying to to pull out a table from the the website https://understat.com/league/EPL
The table I am trying to import into R is highlighted in red in the screenshot here;
screenshot of website
Using inspect tools I can see the xpath to the table as follows;
//*[#id="league-chemp"]/table
full XPath is
/html/body/div[1]/div[3]/div[3]/div/div[2]/div/table
My code is as follows;
library(rvest)
library(selectr)
library(xml2)
library(jsonlite)
library(htmltab)
library(RCurl)
library(XML)
url <- 'https://understat.com/league/EPL'
webpage <- read_html('https://understat.com/league/EPL')
xpath <- "/html/body/div[1]/div[3]/div[3]/div/div[2]/div/table/tbody"
nodes <- html_nodes(webpage, xpath = xpath)
However the response is;
> nodes
{xml_nodeset (0)}
I've hit a dead end, I think there maybe some embedded JSON code and javascript within the main html body of the response that is causing issues, but its all above my expertise right now.

I have been able to extract the table with the following code :
library(rvest)
library(RSelenium)
port <- as.integer(4444L + rpois(lambda = 1000, 1))
rd <- rsDriver(chromever = "105.0.5195.52", browser = "chrome", port = port)
remDr <- rd$client
remDr$open()
url <- "https://understat.com/league/EPL"
remDr$navigate(url)
Sys.sleep(5)
html_Content <- remDr$getPageSource()[[1]]
tables <- read_html(html_Content) %>% html_table()
tables
[[1]]
# A tibble: 20 x 12
`<U+2116>` Team M W D L G GA PTS xG xGA xPTS
<int> <chr> <int> <int> <int> <int> <int> <int> <int> <chr> <chr> <chr>
1 1 Arsenal 9 8 0 1 23 10 24 19.48-3.52 8.17-1.83 20.03-3.97
2 2 Manchester City 9 7 2 0 33 9 23 23.27-9.73 5.81-3.19 23.59+0.59
3 3 Tottenham 9 6 2 1 20 10 20 14.78-5.22 10.60+0.60 15.12-4.88
4 4 Chelsea 8 5 1 2 13 10 16 12.10-0.90 10.62+0.62 11.86-4.14
5 5 Manchester United 8 5 0 3 13 15 15 12.35-0.65 11.41-3.59 11.86-3.14
6 6 Newcastle United 9 3 5 1 17 9 14 18.41+1.41 12.13+3.13 15.73+1.73
7 7 Brighton 8 4 2 2 14 9 14 14.52+0.52 8.58-0.42 15.53+1.53
8 8 Bournemouth 9 3 3 3 8 20 12 5.26-2.74 15.39-4.61 6.43-5.57
9 9 Fulham 9 3 2 4 14 18 11 10.22-3.78 21.34+3.34 7.02-3.98
10 10 Liverpool 8 2 4 2 20 12 10 17.02-2.98 12.33+0.33 12.95+2.95
11 11 Brentford 9 2 4 3 16 17 10 13.28-2.72 13.00-4.00 12.78+2.78
12 12 Everton 9 2 4 3 8 9 10 10.33+2.33 14.67+5.67 9.13-0.87
13 13 West Ham 9 3 1 5 8 10 10 11.51+3.51 9.64-0.36 13.64+3.64
14 14 Leeds 8 2 3 3 11 12 9 10.45-0.55 12.28+0.28 9.73+0.73
15 15 Crystal Palace 8 2 3 3 10 12 9 9.91-0.09 13.71+1.71 8.62-0.38
16 16 Aston Villa 8 2 2 4 6 10 8 8.08+2.08 10.45+0.45 10.24+2.24
17 17 Southampton 9 2 1 6 8 17 7 9.32+1.32 13.88-3.12 9.36+2.36
18 18 Wolverhampton Wanderers 9 1 3 5 3 12 6 8.16+5.16 11.84-0.16 9.54+3.54
19 19 Leicester 9 1 1 7 15 24 4 9.06-5.94 15.12-8.88 8.00+4.00
20 20 Nottingham Forest 8 1 1 6 6 21 4 8.62+2.62 15.17-5.83 7.48+3.48
[[2]]
# A tibble: 11 x 11
`<U+2116>` Player Team Apps Min G A xG xA xG90 xA90
<int> <chr> <chr> <int> <int> <int> <int> <chr> <chr> <dbl> <dbl>
1 1 "Erling Haaland" "Manchester City" 9 768 15 3 10.10-4.90 2.61-0.39 1.18 0.31
2 2 "Harry Kane" "Tottenham" 9 804 8 1 6.72-1.28 2.06+1.06 0.75 0.23
3 3 "Roberto Firmino" "Liverpool" 7 473 6 3 4.06-1.94 1.40-1.60 0.77 0.27
4 4 "Aleksandar Mitrovic" "Fulham" 8 666 6 0 4.53-1.47 0.38+0.38 0.61 0.05
5 5 "Ivan Toney" "Brentford" 9 810 6 2 5.24-0.76 1.55-0.45 0.58 0.17
6 6 "Phil Foden" "Manchester City" 9 678 6 4 3.37-2.63 2.49-1.51 0.45 0.33
7 7 "Gabriel Jesus" "Arsenal" 9 794 5 3 6.29+1.29 1.73-1.27 0.71 0.2
8 8 "James Maddison" "Leicester" 8 716 5 2 1.40-3.60 0.97-1.03 0.18 0.12
9 9 "Leandro Trossard" "Brighton" 8 686 5 1 3.01-1.99 0.70-0.30 0.39 0.09
10 10 "Wilfried Zaha" "Crystal Palace" 7 624 4 1 3.33-0.67 1.24+0.24 0.48 0.18
11 NA "" "" NA NA 252 183 250.82-1.18 180.23-2.77 NA NA

Here is another approach that can be considered :
library(RDCOMClient)
url <- "https://understat.com/league/EPL"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
html_Content <- doc$Body()$innerHTML()
tables <- read_html(html_Content) %>% html_table()
tables
[[1]]
# A tibble: 20 x 12
`?` Team M W D L G GA PTS xG xGA xPTS
<int> <chr> <int> <int> <int> <int> <int> <int> <int> <chr> <chr> <chr>
1 1 Arsenal 9 8 0 1 23 10 24 19.48-3.52 8.17-1.83 20.03-3.97
2 2 Manchester City 9 7 2 0 33 9 23 23.27-9.73 5.81-3.19 23.59+0.59
3 3 Tottenham 9 6 2 1 20 10 20 14.78-5.22 10.60+0.60 15.12-4.88
4 4 Chelsea 8 5 1 2 13 10 16 12.10-0.90 10.62+0.62 11.86-4.14
5 5 Manchester United 8 5 0 3 13 15 15 12.35-0.65 11.41-3.59 11.86-3.14
6 6 Newcastle United 9 3 5 1 17 9 14 18.41+1.41 12.13+3.13 15.73+1.73
7 7 Brighton 8 4 2 2 14 9 14 14.52+0.52 8.58-0.42 15.53+1.53
8 8 Bournemouth 9 3 3 3 8 20 12 5.26-2.74 15.39-4.61 6.43-5.57
9 9 Fulham 9 3 2 4 14 18 11 10.22-3.78 21.34+3.34 7.02-3.98
10 10 Liverpool 8 2 4 2 20 12 10 17.02-2.98 12.33+0.33 12.95+2.95
11 11 Brentford 9 2 4 3 16 17 10 13.28-2.72 13.00-4.00 12.78+2.78
12 12 Everton 9 2 4 3 8 9 10 10.33+2.33 14.67+5.67 9.13-0.87
13 13 West Ham 9 3 1 5 8 10 10 11.51+3.51 9.64-0.36 13.64+3.64
14 14 Leeds 8 2 3 3 11 12 9 10.45-0.55 12.28+0.28 9.73+0.73
15 15 Crystal Palace 8 2 3 3 10 12 9 9.91-0.09 13.71+1.71 8.62-0.38
16 16 Aston Villa 8 2 2 4 6 10 8 8.08+2.08 10.45+0.45 10.24+2.24
17 17 Southampton 9 2 1 6 8 17 7 9.32+1.32 13.88-3.12 9.36+2.36
18 18 Wolverhampton Wanderers 9 1 3 5 3 12 6 8.16+5.16 11.84-0.16 9.54+3.54
19 19 Leicester 9 1 1 7 15 24 4 9.06-5.94 15.12-8.88 8.00+4.00
20 20 Nottingham Forest 8 1 1 6 6 21 4 8.62+2.62 15.17-5.83 7.48+3.48
[[2]]
# A tibble: 11 x 11
`?` Player Team Apps Min G A xG xA xG90 xA90
<int> <chr> <chr> <int> <int> <int> <int> <chr> <chr> <dbl> <dbl>
1 1 "Erling Haaland" "Manchester City" 9 768 15 3 10.10-4.90 2.61-0.39 1.18 0.31
2 2 "Harry Kane" "Tottenham" 9 804 8 1 6.72-1.28 2.06+1.06 0.75 0.23
3 3 "Roberto Firmino" "Liverpool" 7 473 6 3 4.06-1.94 1.40-1.60 0.77 0.27
4 4 "Aleksandar Mitrovic" "Fulham" 8 666 6 0 4.53-1.47 0.38+0.38 0.61 0.05
5 5 "Ivan Toney" "Brentford" 9 810 6 2 5.24-0.76 1.55-0.45 0.58 0.17
6 6 "Phil Foden" "Manchester City" 9 678 6 4 3.37-2.63 2.49-1.51 0.45 0.33
7 7 "Gabriel Jesus" "Arsenal" 9 794 5 3 6.29+1.29 1.73-1.27 0.71 0.2
8 8 "James Maddison" "Leicester" 8 716 5 2 1.40-3.60 0.97-1.03 0.18 0.12
9 9 "Leandro Trossard" "Brighton" 8 686 5 1 3.01-1.99 0.70-0.30 0.39 0.09
10 10 "Wilfried Zaha" "Crystal Palace" 7 624 4 1 3.33-0.67 1.24+0.24 0.48 0.18
11 NA "" "" NA NA 252 183 250.82-1.18 180.23-2.77 NA NA

Related

Can you explain this function and how does this work with examples?

I can't understand this function and I have checked the tutorial website for those unknown function
here is the code:
def print_formatted(number):
# your code goes here
for i in range(1,n + 1):
pad = n.bit_length()
dec = str(i).rjust(pad)
octs = str(oct(i)[2:]).rjust(pad)
hexx = str(hex(i)[2:]).rjust(pad).upper()
bina = str(bin(i)[2:]).rjust(pad)
print(f'{dec} {octs} {hexx} {bina}')
Thanking you in advance!
This is the output it gave when called upon with n = 17
1 1 1 1
2 2 2 10
3 3 3 11
4 4 4 100
5 5 5 101
6 6 6 110
7 7 7 111
8 10 8 1000
9 11 9 1001
10 12 A 1010
11 13 B 1011
12 14 C 1100
13 15 D 1101
14 16 E 1110
15 17 F 1111
16 20 10 10000
17 21 11 10001
so how does this above code makes sure that the output gives not the output like this:
1 1 1 1
2 2 2 10
3 3 3 11
4 4 4 100
5 5 5 101
6 6 6 110
7 7 7 111
8 10 8 1000
9 11 9 1001
10 12 A 1010
11 13 B 1011
12 14 C 1100
13 15 D 1101
14 16 E 1110
15 17 F 1111
16 20 10 10000
17 21 11 10001

Find if a value exists in a Google sheet on a certain column, in all the rows above the current row based on 2 criterias

I have the following scenario:
columns from A-Z and 100 rows
in each row for the Z column I want to find if the value in A column from the current row exists in the rows above in A column
then if exists, I would like to find if the B column for the matching rows have the cell completed with a value
for all the rows that are matching I would to receive the matching rows in an array list, not as rows or at least to be able to put a value like "mathing"/"not matching"
this should be an array formula
I've tried something like this, only for the first criteria, but somehow it checks only the current row.
=ARRAYFORMULA( IF(ROW(Z2:Z)>2, IF(MATCH(A2:A,$A$2:A&ROW(A2:A)-1),"matching","not matching"),"not matching"))
I check to see if it's the first row (as it has headers), and if it's the first row, then surely it can't have any data matching above
It will be great to have it as a google sheet formula but if it's not possible it could also be a google app script

Try this:
function myfunk() {
const ss = SpreadsheetApp.getActive();
const sh = ss.getSheetByName("Sheet0");
const osh = ss.getSheetByName("Sheet1");
osh.clearContents();
const dsr = 2;
const vs = sh.getRange(dsr, 1, sh.getLastRow() - dsr + 1, sh.getLastColumn()).getDisplayValues();
let o = [];
vs.forEach((r, i) => {
if (i > 0) {
let as = vs.map(r => r[0]).slice(0, i);// suggested by DoubleUnary
let bs = vs.map(r => r[1]).slice(0, i);//suggested by DoubleUnary
let idx = as.indexOf(r[25]);
if (~idx && bs[idx]) {
o.push(['yes', dsr + i, dsr + idx, r[25], bs[idx]])
} else {
o.push(['no', dsr + i, ~idx ? as[idx] : '', r[25], ~idx ? bs[idx] : '']);
}
}
});
o.unshift(['Value', 'Test Row', 'Result Row', 'Z value', 'B value'])
Logger.log(JSON.stringify(o));
osh.getRange(1, 1, o.length, o[0].length).setValues(o);
}
My Data:
COL1
COL2
COL3
COL4
COL5
COL6
COL7
COL8
COL9
COL10
COL11
COL12
COL13
COL14
COL15
COL16
COL17
COL18
COL19
COL20
COL21
COL22
COL23
COL24
COL25
COL26
4
4
8
18
3
15
15
6
6
18
2
10
19
14
5
16
3
6
0
13
15
14
10
13
19
7
14
5
18
12
12
3
5
5
12
0
0
4
19
17
13
14
2
6
2
0
18
15
16
1
1
15
14
8
18
19
18
19
14
11
9
2
12
4
19
8
7
17
2
5
17
12
3
18
6
15
12
17
12
15
1
11
2
14
4
12
15
4
2
7
13
12
4
10
0
2
9
2
15
12
18
7
10
6
15
8
3
11
3
11
8
2
0
12
18
12
17
3
3
10
5
18
0
6
19
12
11
2
3
5
16
16
7
14
12
3
1
9
0
1
9
4
17
11
18
2
4
16
13
4
1
3
4
13
9
8
11
18
9
9
10
17
6
16
8
10
15
10
18
1
2
9
10
18
13
0
11
4
7
2
0
18
3
5
1
5
18
17
4
8
2
4
10
13
7
10
9
6
3
7
5
7
12
12
6
0
3
7
3
3
19
4
2
5
0
9
5
14
0
2
15
9
18
6
1
15
5
5
1
12
4
7
9
3
19
19
15
16
12
18
13
0
12
4
12
4
1
8
19
2
1
1
8
14
6
10
0
16
14
14
10
8
3
15
5
13
9
13
10
6
16
2
15
3
2
16
19
2
14
1
10
1
1
5
5
10
8
3
8
17
13
15
8
9
6
4
2
14
6
4
1
6
14
8
9
11
12
3
18
5
14
9
18
2
12
17
2
17
10
0
11
7
11
2
0
11
15
6
7
13
10
18
17
6
19
12
14
15
7
12
5
0
17
15
2
2
18
6
7
13
1
10
19
9
7
13
15
13
7
18
11
13
10
8
1
10
5
17
9
9
5
14
3
3
1
19
7
13
0
5
10
2
12
17
3
12
9
0
10
9
15
6
14
18
1
3
6
4
9
19
4
9
15
11
0
3
10
19
5
18
16
10
4
4
4
1
1
6
8
10
9
8
19
4
11
18
12
14
8
4
5
11
8
17
5
7
13
13
16
14
8
7
14
7
18
9
3
11
0
1
7
19
8
6
3
4
4
2
4
11
3
7
5
5
9
16
15
7
6
4
6
7
17
8
13
10
2
9
18
0
13
12
4
13
9
4
19
4
7
10
17
1
5
5
3
7
12
3
19
19
7
1
11
9
9
9
7
5
6
8
7
0
11
19
6
17
12
1
18
Results:
Value
Test Row
Result Row
Z value
B value
no
3
15
no
4
17
no
5
6
no
6
5
no
7
8
no
8
18
no
9
7
yes
10
9
3
5
yes
11
3
14
5
no
12
10
yes
13
3
14
5
yes
14
3
14
5
yes
15
12
10
8
yes
16
12
10
8
yes
17
2
4
4
no
18
8
8
yes
19
6
15
8
no
20
5
no
21
18

Webscraping Pokemon Data

I am trying to find out the number of moves each Pokemon (first generation) could learn.
I found the following website that contains this information: https://pokemondb.net/pokedex/game/red-blue-yellow
There are 151 Pokemon listed here - and for each of them, their move set is listed on a template page like this: https://pokemondb.net/pokedex/bulbasaur/moves/1
Since I am using R, I tried to get the website addresses for each of these 150 Pokemon (https://docs.google.com/document/d/1fH_n_BPbIk1bZCrK1hLAJrYPH2d5RTy9IgdR5Ck_lNw/edit#):
names = c("Bulbasaur","Ivysaur","Venusaur","Charmander","Charmeleon","Charizard","Squirtle","Wartortle","Blastoise","Caterpie","Metapod","Butterfree","Weedle","Kakuna","Beedrill",
"Pidgey","Pidgeotto","Pidgeot","Rattata","Raticate","Spearow","Fearow","Ekans","Arbok","Pikachu","Raichu","Sandshrew","Sandslash","Nidoran","Nidorina","Nidoqueen","Nidorino","Nidoking",
"Clefairy","Clefable","Vulpix","Ninetales","Jigglypuff","Wigglytuff","Zubat","Golbat","Oddish","Gloom","Vileplume","Paras","Parasect","Venonat","Venomoth","Diglett","Dugtrio","Meowth","Persian",
"Psyduck","Golduck","Mankey","Primeape","Growlithe","Arcanine","Poliwag","Poliwhirl","Poliwrath","Abra","Kadabra","Alakazam","Machop","Machoke","Machamp","Bellsprout","Weepinbell","Victreebel","Tentacool",
"Tentacruel","Geodude","Graveler","Golem","Ponyta","Rapidash","Slowpoke","Slowbro","Magnemite","Magneton","Farfetch’d","Doduo","Dodrio","Seel","Dewgong","Grimer","Muk","Shellder","Cloyster","Gastly","Haunter",
"Gengar","Onix","Drowzee","Hypno","Krabby","Kingler","Voltorb","Electrode","Exeggcute","Exeggutor","Cubone","Marowak","Hitmonlee","Hitmonchan","Lickitung","Koffing","Weezing","Rhyhorn","Rhydon","Chansey","Tangela",
"Kangaskhan","Horsea","Seadra","Goldeen","Seaking","Staryu","Starmie","Mr.Mime","Scyther","Jynx","Electabuzz","Magmar","Pinsir","Tauros","Magikarp","Gyarados","Lapras","Ditto"
,"Eevee","Vaporeon","Jolteon","Flareon","Porygon","Omanyte","Omastar","Kabuto","Kabutops","Aerodactyl","Snorlax","Articuno","Zapdos","Moltres","Dratini","Dragonair","Dragonite","Mewtwo","Mew")
template_1 = rep("https://pokemondb.net/pokedex/",150)
template_2 = rep("/moves/1",150)
pokemon_websites = data.frame(template_1, names, template_2)
pokemon_websites$full_website = paste(pokemon_websites$template_1, pokemon_websites$names, pokemon_websites$template_2)
Next, I remove all spaces:
library(stringr)
pokemon_websites$full_website = str_remove_all( pokemon_websites$full_website," ")
Now, I have a column with all the website names:
head(pokemon_websites)
template_1 names template_2 full_website
1 https://pokemondb.net/pokedex/ Bulbasaur /moves/1 https://pokemondb.net/pokedex/Bulbasaur/moves/1
2 https://pokemondb.net/pokedex/ Ivysaur /moves/1 https://pokemondb.net/pokedex/Ivysaur/moves/1
3 https://pokemondb.net/pokedex/ Venusaur /moves/1 https://pokemondb.net/pokedex/Venusaur/moves/1
4 https://pokemondb.net/pokedex/ Charmander /moves/1 https://pokemondb.net/pokedex/Charmander/moves/1
5 https://pokemondb.net/pokedex/ Charmeleon /moves/1 https://pokemondb.net/pokedex/Charmeleon/moves/1
6 https://pokemondb.net/pokedex/ Charizard /moves/1 https://pokemondb.net/pokedex/Charizard/moves/1
I would like to count the number of moves each of these 150 Pokemon can learn. For example, the first Pokemon "Bulbasaur" can learn 24 moves:
In the end, I would like to add a column to the earlier data frame that contains the number of moves each Pokemon can learn. For example, something that looks like this:
> head(pokemon_websites)
template_1 names template_2 full_website number_of_moves
1 https://pokemondb.net/pokedex/ Bulbasaur /moves/1 https://pokemondb.net/pokedex/Bulbasaur/moves/1 24
2 https://pokemondb.net/pokedex/ Ivysaur /moves/1 https://pokemondb.net/pokedex/Ivysaur/moves/1 ???
3 https://pokemondb.net/pokedex/ Venusaur /moves/1 https://pokemondb.net/pokedex/Venusaur/moves/1 ???
4 https://pokemondb.net/pokedex/ Charmander /moves/1 https://pokemondb.net/pokedex/Charmander/moves/1 ???
5 https://pokemondb.net/pokedex/ Charmeleon /moves/1 https://pokemondb.net/pokedex/Charmeleon/moves/1 ???
6 https://pokemondb.net/pokedex/ Charizard /moves/1 https://pokemondb.net/pokedex/Charizard/moves/1 ???
Is there a way to webscrape this data in R, count the number of moves for each of the 150 Pokemon, and then place this move count into a column?
Right now I am doing this by hand and it is taking a long time! Also, I have heard some websites do not allow for automated webscraping - if this website (https://pokemondb.net/pokedex/game/red-blue-yellow) does not allow webscraping, I can try to find another website that might allow it.
Thank you!

You can scrape all the tables for each of the pokemen using something like this:
tables =lapply(pokemon_websites$full_website,function(link) {
tryCatch(
read_html(link) %>% html_nodes("table") %>% html_table(),
error = function(e) {}, warning=function(w) {}
)
})
However, note that the number of tables returned differs for each of the pokemon. For example the first has 6 tables - the first three of those are for Red/Blue, the second three of those are for Yellow.
lengths(tables)
[1] 6 6 6 6 6 6 6 6 6 2 4 7 2 4 8 6 6 6 4 4 6 6 6 6 6 8 6 6 0 4 8 4 8 6 8 4 6 6 8 4 4 6 6 8 6 6 5 5 5 5 4 4 6 6 6
[56] 6 4 6 6 6 8 6 6 6 6 6 6 6 6 8 6 6 6 6 6 4 4 6 6 6 6 0 6 6 6 6 4 4 6 8 4 4 6 6 6 6 6 6 6 6 4 8 6 7 6 6 6 4 4 6
[111] 6 6 6 6 6 6 6 6 6 8 0 6 4 6 6 6 6 2 8 6 2 4 8 8 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6

Since the OP wants to count only the moves in Red/Blue tab we can do the following, (If you need moves from both the tabs follow #langtang answer)
tables1 =lapply(pokemon_websites$full_website, function(x){
tryCatch( x %>% read_html() %>% html_nodes('.active') %>% html_nodes('.resp-scroll') %>% html_table(),
error = function(e) NULL
)
})
moves= lapply(tables1, function(x) lapply(x, function(x) dim(x)[1]))
moves = lapply(moves, unlist, use.names=FALSE)
moves = lapply(moves, sum) %>% unlist()
[1] 24 25 27 32 33 37 32 33 37 2 3 30 2 3 26 22 23 25 24 27 21 23 24 26 29 30 27 29 0 28 43 28 44 41 42 22 23 40 41 19 22 21 23 25 23 26 22 29 20 23 24
[52] 27 31 34 31 34 23 25 25 36 37 25 34 35 29 31 32 23 24 26 28 31 30 31 33 22 26 37 46 23 26 0 23 26 25 28 22 24 27 29 20 20 32 25 33 36 25 27 24 27 24 29
[103] 32 35 26 26 37 19 21 27 42 44 24 36 22 24 25 27 33 34 0 21 34 35 30 23 27 2 34 34 1 19 32 32 28 30 22 28 22 30 24 43 25 25 22 30 32 36 45 60

R and DBI dbWriteTable connection to MySQL/MariaDB only imports first row

I'm using an AWS mariaDB to store some data. My idea was to do the full management with the DBI package. However, I have found that DBI only imports the first row of the data when I try to write a table in the db. I have to use DBI::dbCreateTable and dbx::dbxInsert. I can't figure out why DBI is not importing the full data frame.
I have gone through this post but the conclusion is not quite clear. This is the code/output:
con <- DBI::dbConnect(odbc::odbc(), "my_odbc", timeout = 10)
## Example 1 - doesn't work
DBI::dbWriteTable(con, "test1", mtcars)
DBI::dbReadTable(con, "test1")
row_names mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
# Example 2 - doesn't work
DBI::dbCreateTable(con, "test2", mtcars)
DBI::dbAppendTable(con, "test2", mtcars)
[1] 1
DBI::dbReadTable(con, "test2")
mpg cyl disp hp drat wt qsec vs am gear carb
1 21 6 160 110 3.9 2.62 16.46 0 1 4 4
# Example 3 - does work.
DBI::dbCreateTable(con, "test3", mtcars)
dbx::dbxInsert(con, "test3", mtcars)
DBI::dbReadTable(con, "test3")
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4

I had a similar issue and if you aren't careful with how you define and use your primary keys you get this issue. The first row is allowed as its the first with that primary key and then the rows after are blocked and hence dont get inserted.

How can I replace empty cells with NA in R?

I'm new to R, and have been trying a bunch of examples but I couldn't get anything to change all of my empty cells into NA.
library(XML)
theurl <- "http://www.pro-football-reference.com/teams/sfo/1989.htm"
table <- readHTMLTable(theurl)
table
Thank you.

The result you get from readHTMLTable is giving you a list of two tables, so you need to work on each list element, which can be done using lapply
table <- lapply(table, function(x){
x[x == ""] <- NA
return(x)
})
table$team_stats
Player PF Yds Ply Y/P TO FL 1stD Cmp Att Yds TD Int NY/A 1stD Att Yds TD Y/A 1stD Pen Yds 1stPy
1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1 209 493 1966 14 4.0 124 109 922 17
2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3 178 372 1383 9 3.7 76 75 581 29
3 Lg Rank Offense 1 1 <NA> <NA> 2 10 1 <NA> 20 2 1 1 1 <NA> 13 10 12 13 <NA> <NA> <NA> <NA>
4 Lg Rank Defense 3 4 <NA> <NA> 11 9 9 <NA> 25 11 3 9 5 <NA> 1 3 3 8 <NA> <NA> <NA> <NA>

You have a list of data.frames of factors, though the actual data is mostly numeric. Converting to the appropriate type with type.convert will automatically insert the appropriate NAs for you:
df_list <- lapply(table, function(x){
x[] <- lapply(x, function(y){type.convert(as.character(y), as.is = TRUE)});
x
})
df_list[[1]][, 1:18]
## Player PF Yds Ply Y/P TO FL 1stD Cmp Att Yds.1 TD Int NY/A 1stD.1 Att.1 Yds.2 TD.1
## 1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1 209 493 1966 14
## 2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3 178 372 1383 9
## 3 Lg Rank Offense 1 1 NA NA 2 10 1 NA 20 2 1 1 1.0 NA 13 10 12
## 4 Lg Rank Defense 3 4 NA NA 11 9 9 NA 25 11 3 9 5.0 NA 1 3 3
Or more concisely but with a lot of packages,
library(tidyverse) # for purrr functions and readr::type_convert
library(janitor) # for clean_names
df_list <- map(table, ~.x %>% clean_names() %>% dmap(as.character) %>% type_convert())
df_list[[1]]
## # A tibble: 4 × 23
## player pf yds ply y_p to fl x1std cmp att yds_2 td int ny_a
## <chr> <int> <int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
## 1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1
## 2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3
## 3 Lg Rank Offense 1 1 NA NA 2 10 1 NA 20 2 1 1 1.0
## 4 Lg Rank Defense 3 4 NA NA 11 9 9 NA 25 11 3 9 5.0
## # ... with 9 more variables: x1std_2 <int>, att_2 <int>, yds_3 <int>, td_2 <int>, y_a <dbl>,
## # x1std_3 <int>, pen <int>, yds_4 <int>, x1stpy <int>

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

web scraping understat website to retrieve table failing in R - html

Related

Can you explain this function and how does this work with examples?

Find if a value exists in a Google sheet on a certain column, in all the rows above the current row based on 2 criterias

Webscraping Pokemon Data

R and DBI dbWriteTable connection to MySQL/MariaDB only imports first row

How can I replace empty cells with NA in R?

Categories

Resources